From the Lab
Deep-dive research reports and editorial from the Provenance team — every thesis traceable to the exact filing sentence that triggered it.
Why raw data is a commodity and signal memory is not — and why Nike's nineteenth consecutive quarter of operating margin pressure is not in Bloomberg.
We embedded 150 million sentences from SEC filings into 768-dimensional space. HDBSCAN found 551 clusters. One stopped us cold: 3,282 sentences about nothing happening.
151 million sentences. 924,000 filings. 472 classifiers — 159 hand-designed, 313 discovered by machines. Here is what a decade of reading every SEC filing taught us about predicting stock moves: why AUC lies, why trajectories beat events, and why 4 of our 5 most predictive signals were concepts nobody thought to look for.
The SIC system was designed in 1937. GICS in 1999. Both classify companies by what they sell. We classified 3,000 companies by what they say — 469 binary classifiers across 150 million SEC sentences — and found clusters that predict stock behavior better than any existing taxonomy.
Identify companies with multiple distress signals firing simultaneously. When revenue declines and costs get cut — pay attention.
Velocity analysis reveals turning points that linear regression misses. The first derivative tells the story.
How we extract signals from SEC filings and visualize corporate health trajectories across decades.
Quality biotech oscillators. 82% win rate, 75-day median hold, +62.5% avg return per wave.
14 classifiers detect financial distress. COVID shock at -3.5σ, recovery signals tracked across 800+ companies.
9 biotech classifiers, 10 years of filings. Strongest signal in company history.
22 classifiers, 4.2M sentences. COVID shock at -4.7σ, commodity boom at +3.3σ.
29 classifiers, 1.2M sentences. COVID breakdown at -2.4σ, now recovering toward pre-pandemic baseline.
Three diverging paths across IT services, staffing, and consulting as AI reshapes demand signals from 300+ companies over 20 years of filings.
Get Access
Request access to the full classifier dataset — every signal, every filing, every sentence.
Request Access