What 151 Million Sentences Taught Us About Predicting Stock Moves

Every quarter, roughly 8,000 public companies file documents with the SEC. A typical 10-Q runs 60 to 120 pages. A 10-K can hit 300. Multiply that across two decades and you get a corpus that nobody — no analyst, no fund, no team of interns — has ever read in its entirety.

We did. Or rather, our systems did. 151 million sentences across 924,000 filings. Every quarterly report, every annual filing, every company from the largest mega-cap to the smallest micro-cap that most investors have never heard of. This is what we learned.

151M

Sentences

924K

Filings

472

Classifiers

Archetypes

Reading Every Sentence

The thesis was simple: SEC filings contain information that moves stock prices, and most of that information goes unread.

Consider a $200 million company with zero analyst coverage. It files a 120-page 10-Q every quarter. Nobody reads it. Nobody summarizes it. Nobody flags the new paragraph on page 47 where management discloses "substantial doubt about the entity's ability to continue as a going concern." The information sits on EDGAR, technically public, practically invisible.

We built 472 binary classifiers. Each one answers exactly one question about a single sentence. "Does this describe a going concern warning?" "Is this a clinical trial result?" "Does this mention a management departure?" One sentence, one question, one answer.

This is not keyword matching. The classifiers operate on 768-dimensional semantic embeddings — dense numerical representations of meaning. The sentences "substantial doubt about ability to continue" and "significant uncertainty regarding future operations" share almost no words but carry the same meaning. Both trigger is_going_concern_warning. Meanwhile, "we will continue to invest in growth" shares the word "continue" but carries no distress signal at all. The embedding space captures this distinction cleanly.

The 472 classifiers come from two sources — and the split taught us something we didn't expect.

159 were hand-designed. These came from investment theses — a portfolio manager says "I want to know when companies start disclosing supply chain problems," and we build a classifier for exactly that. Targeted, precise, and limited by human imagination.

313 were discovered by machines. We embedded millions of sentences and ran unsupervised clustering (HDBSCAN) on the embedding space. It found 551 natural groupings — concepts nobody hypothesized. After filtering for quality, 313 survived as production classifiers.

4 of the top 5 most predictive classifiers were machine-discovered. The concepts a human would think to look for are often the concepts the market has already priced. The unexploited signal lives in categories nobody knew existed.

Learning to Distrust AUC

Early on, we evaluated classifiers the standard way: AUC, precision, recall, F1. A classifier with 0.98 AUC looks like a home run.

One of our highest-AUC classifiers scored "hired 500 employees" as a layoff. It had learned to detect workforce change sentences and performed beautifully on the held-out test set. In practice, it was confusing opposite events.

AUC measures discrimination — can the model separate positive from negative examples? — but it says nothing about whether the model understands what it's separating. So we built three gates that every classifier must pass before entering production.

Gate 1

Separability

Can the classifier distinguish its target from closely related concepts — not just from random sentences?

Gate 2

Fire Rate

Does it trigger in a useful range — not so broad it catches everything, not so rare it can't build temporal patterns?

Gate 3

Coherence

Do its high-confidence positives cluster tightly in embedding space — one coherent concept, not scattered neighborhoods?

We removed 62 classifiers that passed AUC thresholds but failed the coherence gate. Downstream predictive performance improved by 48%. Removing bad classifiers helped more than adding good ones. This was a recurring theme.

From Events to Trajectories

A single classifier firing once tells you almost nothing. Companies copy-paste 80% or more of their filings quarter to quarter. If a company mentioned going concern risk last quarter and mentions it again this quarter, is that news? Usually not. It's the same paragraph, carried forward.

The signal is not in the events. The signal is in the trajectories.

Is the distress language new, or has it been building for three quarters? Did a catalyst just appear alongside persistent distress? Did the company stop mentioning a risk it had disclosed for two years — and does that silence mean the risk resolved, or that someone decided to stop talking about it?

We built temporal features to capture these patterns. Echo decay tracks how classifier signals fade over time — a signal from last quarter matters more than one from two years ago, but it still matters. Fire rate trending measures whether a classifier is firing more or less than usual for a specific company, expressed as z-scores. Persistence models track Markov transition probabilities between states: if a company is in distress this quarter, what's the probability it transitions to recovery next quarter?

The temporal features, not the classifiers themselves, turned out to carry the most predictive weight. The classifiers are the vocabulary — the words to describe what's happening. The temporal features are the grammar — they tell us whether what's happening is new, escalating, persistent, or resolving.

What We Layered On Top

SEC filings are the core. But they are quarterly snapshots. Between filings, other signals fill the gaps.

RSS feeds. We mined 4 million RSS URLs from Common Crawl and curated them to 70,000 active feeds across 20-plus languages. 148 of the first 159 clusters were recipe blogs and astrology sites. But buried in the rest: European financial news sources that consistently led US sector rotations by one to four trading days. The alpha wasn't in obscurity — it was in geography.

Insider transactions. We processed 6.1 million transactions from SEC Form 4 filings. Most insider transactions are noise: scheduled sales, option exercises, routine diversification. But when multiple insiders buy the same stock on the same day — what we call "cluster buys" — the signal gets loud. In March 2020, as markets crashed, insider purchases tripled.

8-K event codes. Structured data hiding in plain sight. Every 8-K carries event codes — a change in auditors, a delisting notice, a material agreement. We processed 904,000 filings containing 1.17 million events. Delisting notices have increased 29-fold since 2004. Among SPAC-era IPOs, 57% have received at least one.

Credit spreads. The bond market prices distress before the equity market does. Adding credit spread data from FINRA TRACE was worth more to our models than all news-derived classifiers combined. When a company's borrowing costs spike, the bond market is telling you something the stock price hasn't absorbed yet.

Each source adds a dimension the others can't see. But — and this was a hard lesson — more data does not automatically mean better predictions. Every time we dumped raw features from a new source into our models, performance degraded. Raw data is just noise with potential. The work of turning it into signal is where most of the effort goes, and most of the value is created.

What Emerged

When we clustered roughly 3,000 companies by their classifier profiles — their 472-dimensional fingerprints — 10 natural archetypes emerged. These are not industries. They are situations.

SPAC / IPO Shells

35.7%

21-day mover rate

Distressed Micro-Caps

26.2%

21-day mover rate

Governance Upheaval

18.9%

21-day mover rate

Banks & Lenders

0.9%

21-day mover rate

That is a 40-fold spread between the most and least volatile archetypes, derived entirely from the language in their SEC filings.

Within each archetype, we find what we call classifier twins — pairs of companies whose profiles are 92% or more similar. They file similar language, face similar risks, and sit in similar positions on similar catalyst timelines. When one twin makes a big move, the other often follows — because the underlying situation is the same, even if the market hasn't noticed the connection.

This classification — emergent from language, not assigned by committee — predicts behavior better than SIC codes (designed 1937) or GICS (1999). Both systems classify companies by what they sell. Our profiles classify companies by what they're experiencing. The filing language captures both the shared distress and the divergent catalysts. That distinction matters.

What's Coming

This post is a map. What follows is a series walking through each territory in detail.

50% Gains Start in Footnotes

86% of stocks that gained 50%+ in 21 days had an SEC filing within 30 days. Not earnings. Not tweets. A filing.

What Filings Say When They Say Nothing

3,282 sentences from hundreds of companies, across decades, all saying: nothing happened. The model learned negation as disclosure.

The SPAC Hangover

Delisting notices up 29x since 2004. Among 2020–2022 SPAC IPOs, 57% have received at least one.

When CEOs Buy the Dip

Insider purchases tripled in March 2020. 6.1M transactions processed. 35% of what screeners call "insider buying" is option exercises.

We built these tools to find signal in text. What we found was that the signal was always there — in the footnotes on page 47, in the going concern paragraph that appeared for the third straight quarter, in the filing that nobody opened because the company was too small for anyone to care. The classifiers just made it countable.