The Standard Industrial Classification system was designed in 1937. Its replacement, GICS, was created by MSCI and S&P in 1999. Both classify companies by what they sell: semiconductors, pharmaceuticals, commercial banking.

We classified companies by what they say.

The Idea

We have 469 binary classifiers that run on every sentence of every SEC filing. Each one detects a specific concept: is_going_concern_warning, is_clinical_trial_result, is_mineral_exploration_delineation, is_convertible_debt_conversion. When a company files its 10-Q, these classifiers produce a profile — which ones fire and which ones don't.

A biotech startup's profile looks nothing like a regional bank's. Not because we told the system one is biotech and the other is banking — we didn't. Because the language they use in their filings is fundamentally different. The biotech talks about clinical trials, FDA designations, enrollment milestones, endpoint results. The bank talks about loan loss provisions, net interest margin, deposit growth, credit risk methodology.

The classifier profile is a fingerprint. And when we clustered 3,000 companies by that fingerprint — no labels, no sector codes, just the 469-dimensional classifier vector — natural groupings emerged that predict stock behavior better than any existing classification system.

What the Clusters Found

We ran KMeans on the average classifier profile per company and looked at what grouped together. Ten clusters emerged. The top three had mover rates above 18%. The bottom three were below 3%. The spread between the best and worst cluster was 40×.

No industry classification produces that kind of separation.

SPAC / IPO Shells
35.7%
21-day mover rate
Distressed Micro-Caps
26.2%
21-day mover rate
Governance Upheaval
18.9%
21-day mover rate
Banks & Lenders
0.9%
21-day mover rate

SPAC/IPO Shells — 35.7% mover rate

These companies are defined by incorporation disclosures, pro forma financial statements, multi-class voting structures, and warrant accounting. They're recently listed, often pre-revenue, with complex capital structures designed for de-SPAC transactions.

Every quant model knows what a SPAC is. But the classifier profile captures something subtler: the degree of "shell-ness." A company that's been public for two years but still fires heavily on pro forma disclaimers and warrant valuation is structurally different from one that's left its SPAC origins behind. The classifier echo doesn't care about the SPAC label — it measures how much the company still talks like one.

35.7% of these companies gained 50% or more within 21 trading days. The catalyst is usually a business combination closing, a de-SPAC announcement, or the first real revenue quarter.

Distressed Micro-Caps — 26.2% mover rate

Convertible debt conversion language. Black-Scholes warrant valuation. Institutional investment disclosures. Debt discount amortization. These are companies with complex, distressed capital structures where debt is actively converting to equity, warrants are being exercised, and institutional investors are taking positions.

This isn't a sector. SIC would call some of these biotech, some technology, some mining. What they have in common is the capital structure: layered convertible instruments, active warrant accounting, and institutions circling. It's the financial signature of a company in transition — too distressed for most investors, but actively restructuring in ways that create binary outcomes.

Governance Upheaval — 18.9% mover rate

Committee discretionary authority clauses. Director continuity statements. Litigation defense assertions. This cluster captures companies in the middle of governance battles — board reshuffles, activist campaigns, litigation that's forcing structural change.

These companies are fighting internally. The fighting itself is the catalyst. Activist investors force asset sales. Board turnover brings new strategy. Litigation settlements restructure the balance sheet. The governance upheaval cluster has the highest standalone IC of any archetype feature we tested.

Banks and Lenders — 0.9% mover rate

Allowance for credit losses methodology. Quarterly loan-to-deposit growth comparisons. Net interest margin decline disclosures. Lease right-of-use incremental borrowing rate.

Three companies out of 1,300 in this cluster gained 50% in 21 days. Three.

Banks don't bounce. Not because they can't be distressed — they can and often are. But because bank recovery is a slow, quarter-by-quarter grind tied to rate cycles, credit quality, and regulatory capital ratios. There's no binary catalyst. No Phase 3 readout. No de-SPAC announcement. No activist forcing a sale. The filing language tells you this: it's all incremental disclosures about accounting methodology. Nothing in a bank's 10-Q says "something dramatic is about to happen."

What SIC Gets Wrong

Standard Industrial Classification codes describe what a company produces. SIC 6020 is "State commercial banks — Federal Reserve members." SIC 2836 is "Biological products, except diagnostic substances."

This tells you something, but it misses the most important dimension: what kind of situation is this company in?

A biotech that just received a Complete Response Letter from the FDA and is running out of cash is in a completely different situation from a profitable biotech expanding into new indications. SIC calls them both 2836. Our classifier profile separates them — one fires on is_crl_received and is_capital_raising_uncertainty, the other fires on is_indication_expansion and is_revenue_growing. The first has a 26% chance of a massive move. The second has a 5% chance.

SIC also fails at boundaries. Is a company that develops AI software for drug discovery a tech company (SIC 7372) or a pharmaceutical company (SIC 2836)? The SEC assigns one code. Our classifier profile doesn't choose — it reflects both: the company fires on is_clinical_trial_milestone AND is_subscriber_count_disclosure. It's both things at once, and the profile captures that naturally.

The most revealing case: SIC 6199 ("Finance Services, Not Elsewhere Classified"). This is the catch-all. It includes advisory firms, fintech startups, crypto companies, and insurance technology platforms. They share a SIC code. They share nothing else. Our clustering separates them cleanly — the advisory firms land in the bank/lender cluster (0.9% mover rate), the fintech startups land in the distressed micro-cap cluster (26.2%).

The Classification Nobody Would Design

Some of our clusters don't correspond to any industry. "Companies with convertible debt, warrant accounting, and institutional positioning" isn't a sector. You won't find it in any taxonomy. But it's a real category with distinctive filing language and distinctive stock behavior.

This happens because our classifiers detect situations, not products. is_warrant_valued_black_scholes doesn't care if the company makes software or mines copper. It cares that the company has warrants complex enough to require Black-Scholes disclosure. A software company and a mining company that both fire this classifier are more similar — in terms of what will happen to their stock — than two software companies where one does and one doesn't.

The traditional view: classify by product, then study behavior within sectors.
Our view: classify by behavior directly. The filing language IS the behavior.

A Dynamic Taxonomy

SIC codes don't change. Once you're classified as a commercial bank, you're a commercial bank until you reclassify. GICS reviews annually but changes are rare.

Our classifier profiles update every quarter with each new filing. A company can move between archetypes as its situation changes. A healthy biotech that runs into a funding crisis transitions from the biotech cluster to the distressed micro-cap cluster as its filing language shifts — more warrant accounting, more going concern language, more convertible debt disclosures. The classification tracks the reality.

This dynamic reclassification is exactly what makes the system useful for prediction. A company that just entered the "governance upheaval" cluster is more interesting than one that's been there for three years. The transition itself is a signal.

469 Dimensions of Identity

Every company that files with the SEC writes thousands of sentences every quarter. Those sentences pass through 469 classifiers. The resulting vector is a portrait — not of what the company sells, but of what the company is experiencing.

Clinical trials and FDA interactions. Debt restructuring and covenant compliance. Board fights and activist campaigns. Mineral exploration and production reports. Revenue growth and margin expansion. Each classifier captures one axis of corporate life.

When we cluster on all 469 axes simultaneously, the groups that emerge are richer than any human-designed taxonomy because they capture combinations no human would think to define. "SPAC shell with active warrant conversion and institutional positioning" is a type. "Biotech with governance upheaval and distressed capital structure" is a type. These types predict outcomes because they reflect the actual forces acting on the company — not the label someone assigned based on its primary revenue source.

The Practical Impact

We built a model that predicts which stocks will gain 50% or more in 21 trading days. For months, it treated all distressed companies equally. The classifier profile features — the archetype scores — were the single biggest improvement to that model this quarter.

Not because they added new data. They reorganized data we already had. The clinical trial classifiers were already in the model. The going concern classifiers were already there. What wasn't there was the knowledge that "clinical trial classifiers + going concern classifiers = distressed biotech = high bounce probability" while "loan loss provision classifiers + going concern classifiers = distressed bank = no bounce probability."

The model needed to know what kind of company it was looking at. 469 classifiers on 150 million sentences told it.