Live Research · March 2026

Reading what the market skips

We built 469 binary classifiers that read every sentence of every SEC filing. Not headlines. Not earnings calls. The footnotes on page 47 of the 10-Q that nobody opens. The language buried in those pages predicts which stocks are about to break out — weeks before the move.

469 binary classifiers

150 million sentences scored

3.5 million filings analyzed

20+ years of filing history

310 concepts discovered by AI

The Discovery

Not all distress is equal

We studied thousands of stocks that gained 50% or more in 21 trading days. The first finding: 86% had an SEC filing land within 30 days of the breakout. The catalyst was in the paperwork. The second finding was more surprising: the stocks that bounce aren't the ones with good news. They're turnaround stories — distressed companies where the distress is about to resolve.

The coiled spring

Stocks that pop 50%+ are pre-distressed companies where distress signals spike above their own historical baseline, then a catalyst fires. Management departures (1.55x more common than non-movers), leadership changes (1.36x), clinical trial milestones (1.30x). The market over-penalizes the distress. The filing tells you the spring is loaded.

The neglect premium

Movers have less news coverage than non-movers. Less analyst attention. Less Benzinga articles. Every channel that indicates "the market is watching" is anti-correlated with big moves. The edge is in the blind spots — companies too small or too beaten-down for anyone to read their filings.

Growth is the anti-signal

Companies with "growth" profiles — expanding internationally, gaining subscribers, growing market share — have negative signal for big moves. The market already paid for growth. What it hasn't paid for: transformation. Governance upheavals, activist campaigns, corporate restructuring. The boring operational changes that precede repricing.

Binary catalysts, not slow grinds

A distressed biotech with a Phase 3 readout coming is fundamentally different from a distressed restaurant chain with declining same-store sales. Both show distress in their filings. Only one has a binary catalyst that can reprice the stock overnight. The classifier profile tells you which.

The Technology

469 questions, every sentence, every filing

A binary classifier answers one question about one sentence: "Does this sentence describe X?" Not keyword matching — semantic understanding. "Substantial doubt about our ability to continue as a going concern" and "significant uncertainty regarding future operations" trigger the same classifier, even though they share almost no words.

Filing Ingested

Every 10-K and 10-Q filed with the SEC. 3.5 million filings, 150 million sentences, updated as new filings arrive from EDGAR.

469 Classifiers Score Every Sentence

Each sentence passes through 469 binary classifiers. Going concern warning? Clinical trial result? Management departure? Debt covenant breach? 469 yes/no answers per sentence, aggregated per filing.

Filing Profile Computed

Each company gets a 469-dimensional "fingerprint" based on which classifiers fire across its filing history. Not a snapshot — a trajectory. Is the distress language new or accumulating?

Company Scored and Ranked

The filing profile feeds into a ranking model that asks: given this company's classifier trajectory, how likely is a significant move in the next 21-30 days?

The classifiers don't read headlines or earnings call transcripts. They read the legal disclosures that companies are required to file — the language that's written by lawyers, reviewed by auditors, and submitted under penalty of perjury. It's the most reliable text in finance, and almost nobody reads it at scale.

Unsupervised Discovery

310 concepts no human thought to look for

We started with 159 classifiers designed by human analysts — obvious concepts like "going concern warning" and "clinical trial result." Then we asked: what concepts exist in the filing language that humans haven't defined?

The method

We embedded 150 million SEC sentences into a 768-dimensional vector space and ran HDBSCAN clustering. 551 natural groupings emerged — sentences that mean similar things, grouped by the geometry of language rather than human hypothesis.

Each cluster became a classifier candidate. 370 passed quality gates (separability, fire rate, coherence). They detect concepts no analyst defined: "restrictive debt covenant disclosure," "mineral exploration delineation," "CEO team pride statement," "merger acquisition synergy benefit claim."

The validation

We ran a coherence audit: do each classifier's positive sentences actually cluster tightly in embedding space? 62 human-designed classifiers failed — they were firing on loosely related vocabulary rather than a tight concept. Removing them improved prediction quality by 48%.

Meanwhile, 4 of the top 5 most predictive classifiers in the final model were AI-discovered, not human-designed. The unsupervised method found concepts that humans missed — and those concepts turned out to be among the strongest signals.

The traditional pipeline: human guesses a concept, labels data, trains classifier, hopes it works. Our pipeline: let the data find the concepts, validate coherence, let the model decide what's useful. The machine found 310 concepts worth looking for. Humans found 159. The machine's were better.

A New Classification

What a company says tells you what it is

SIC codes classify companies by what they sell. We classify companies by what they say. Clustering 3,000 firms by their 469-classifier profile reveals natural archetypes that predict behavior better than any industry code.

SPAC / IPO Shells

Incorporation disclosures, pro forma statements, warrant accounting. Recently listed, pre-revenue, complex capital structures. These are the companies most likely to experience dramatic repricing events.

Distressed Micro-Cap

Convertible debt conversions, Black-Scholes warrant valuation, institutional positioning. Companies in active financial restructuring — too distressed for most investors, but actively transitioning in ways that create binary outcomes.

Governance Upheaval

Board reshuffles, activist campaigns, litigation driving structural change. The highest single-classifier signal in the entire model. Internal fighting is the catalyst — activists force asset sales, new boards bring new strategy.

Biotech Clinical-Stage

Clinical trial results, FDA interactions, indication expansions. Binary catalysts: the trial succeeds or it doesn't. The filing language reveals where in the pipeline the company sits and how close the binary event is.

Banks & Lenders

Loan loss provisions, net interest margin disclosures, deposit growth comparisons. These companies almost never experience dramatic positive moves. The filing language is all incremental accounting methodology — nothing says "something dramatic is about to happen."

40x

The spread between the highest and lowest archetype is 40x — a company's classifier profile predicts its behavior more precisely than its industry classification.

Structural Similarity

Finding a company's filing twin

When two companies have near-identical classifier profiles — the same classifiers fire, the same ones don't — they're "structural twins." Same business dynamics, same risk factors, same catalyst pipeline. Not because we told the system they're similar. Because their filings say the same things.

Gold miners

Two gold mining companies with 92.3% classifier similarity. Same exploration language, same production disclosures, same risk factors. When one announces a resource estimate upgrade, the other is structurally positioned for the same catalyst.

Clinical-stage biotech

A group of 5 biotech companies that keep appearing as each other's twins. Same clinical trial language, same FDA interaction patterns, same capital structure disclosures. When one advances, the others have the same pipeline.

Traditional similarity measures use price correlation — which stocks moved together in the past. Classifier similarity measures which companies ARE together — same filings, same language, same situation. Price correlation breaks when markets change. Classifier similarity is structural — it changes only when the company's filings change.

Scale

The foundation

Embedding model

E5-base-v2, 768 dimensions. Every sentence mapped to a point in semantic space where meaning — not vocabulary — determines proximity. "CEO was fired" and "CEO was hired" land in different neighborhoods. Previous 384-dim model couldn't separate them.

Quality gates

Six gates every classifier must pass: training accuracy (AUC >= 0.85), concept separation (d' >= 2.0), fire rate control (< 1% for AI-discovered), and coherence (positives must cluster in embedding space). 62 classifiers removed for failing coherence. Quality over quantity.

Continuous monitoring

Classifier profiles update with every new filing. A company's archetype can shift as its situation changes — a healthy biotech entering a funding crisis transitions from one archetype to another as its filing language shifts. The classification tracks reality in real time.