We built 469 binary classifiers that read every sentence of every SEC filing. Not headlines. Not earnings calls. The footnotes on page 47 of the 10-Q that nobody opens. The language buried in those pages predicts which stocks are about to break out — weeks before the move.
We studied thousands of stocks that gained 50% or more in 21 trading days. The first finding: 86% had an SEC filing land within 30 days of the breakout. The catalyst was in the paperwork. The second finding was more surprising: the stocks that bounce aren't the ones with good news. They're turnaround stories — distressed companies where the distress is about to resolve.
Stocks that pop 50%+ are pre-distressed companies where distress signals spike above their own historical baseline, then a catalyst fires. Management departures (1.55x more common than non-movers), leadership changes (1.36x), clinical trial milestones (1.30x). The market over-penalizes the distress. The filing tells you the spring is loaded.
Movers have less news coverage than non-movers. Less analyst attention. Less Benzinga articles. Every channel that indicates "the market is watching" is anti-correlated with big moves. The edge is in the blind spots — companies too small or too beaten-down for anyone to read their filings.
Companies with "growth" profiles — expanding internationally, gaining subscribers, growing market share — have negative signal for big moves. The market already paid for growth. What it hasn't paid for: transformation. Governance upheavals, activist campaigns, corporate restructuring. The boring operational changes that precede repricing.
A distressed biotech with a Phase 3 readout coming is fundamentally different from a distressed restaurant chain with declining same-store sales. Both show distress in their filings. Only one has a binary catalyst that can reprice the stock overnight. The classifier profile tells you which.
A binary classifier answers one question about one sentence: "Does this sentence describe X?" Not keyword matching — semantic understanding. "Substantial doubt about our ability to continue as a going concern" and "significant uncertainty regarding future operations" trigger the same classifier, even though they share almost no words.
We started with 159 classifiers designed by human analysts — obvious concepts like "going concern warning" and "clinical trial result." Then we asked: what concepts exist in the filing language that humans haven't defined?
We embedded 150 million SEC sentences into a 768-dimensional vector space and ran HDBSCAN clustering. 551 natural groupings emerged — sentences that mean similar things, grouped by the geometry of language rather than human hypothesis.
Each cluster became a classifier candidate. 370 passed quality gates (separability, fire rate, coherence). They detect concepts no analyst defined: "restrictive debt covenant disclosure," "mineral exploration delineation," "CEO team pride statement," "merger acquisition synergy benefit claim."
We ran a coherence audit: do each classifier's positive sentences actually cluster tightly in embedding space? 62 human-designed classifiers failed — they were firing on loosely related vocabulary rather than a tight concept. Removing them improved prediction quality by 48%.
Meanwhile, 4 of the top 5 most predictive classifiers in the final model were AI-discovered, not human-designed. The unsupervised method found concepts that humans missed — and those concepts turned out to be among the strongest signals.
SIC codes classify companies by what they sell. We classify companies by what they say. Clustering 3,000 firms by their 469-classifier profile reveals natural archetypes that predict behavior better than any industry code.
Incorporation disclosures, pro forma statements, warrant accounting. Recently listed, pre-revenue, complex capital structures. These are the companies most likely to experience dramatic repricing events.
Convertible debt conversions, Black-Scholes warrant valuation, institutional positioning. Companies in active financial restructuring — too distressed for most investors, but actively transitioning in ways that create binary outcomes.
Board reshuffles, activist campaigns, litigation driving structural change. The highest single-classifier signal in the entire model. Internal fighting is the catalyst — activists force asset sales, new boards bring new strategy.
Clinical trial results, FDA interactions, indication expansions. Binary catalysts: the trial succeeds or it doesn't. The filing language reveals where in the pipeline the company sits and how close the binary event is.
Loan loss provisions, net interest margin disclosures, deposit growth comparisons. These companies almost never experience dramatic positive moves. The filing language is all incremental accounting methodology — nothing says "something dramatic is about to happen."
When two companies have near-identical classifier profiles — the same classifiers fire, the same ones don't — they're "structural twins." Same business dynamics, same risk factors, same catalyst pipeline. Not because we told the system they're similar. Because their filings say the same things.
Two gold mining companies with 92.3% classifier similarity. Same exploration language, same production disclosures, same risk factors. When one announces a resource estimate upgrade, the other is structurally positioned for the same catalyst.
A group of 5 biotech companies that keep appearing as each other's twins. Same clinical trial language, same FDA interaction patterns, same capital structure disclosures. When one advances, the others have the same pipeline.
E5-base-v2, 768 dimensions. Every sentence mapped to a point in semantic space where meaning — not vocabulary — determines proximity. "CEO was fired" and "CEO was hired" land in different neighborhoods. Previous 384-dim model couldn't separate them.
Six gates every classifier must pass: training accuracy (AUC >= 0.85), concept separation (d' >= 2.0), fire rate control (< 1% for AI-discovered), and coherence (positives must cluster in embedding space). 62 classifiers removed for failing coherence. Quality over quantity.
Classifier profiles update with every new filing. A company's archetype can shift as its situation changes — a healthy biotech entering a funding crisis transitions from one archetype to another as its filing language shifts. The classification tracks reality in real time.