We embedded 150 million sentences from SEC filings into a 768-dimensional vector space and asked a simple question: what natural groupings exist in this language?

No hypotheses. No labels. Just structure.

HDBSCAN found 551 clusters. Some were obvious — biotech clinical trial language, executive compensation boilerplate, tax treatment discussions. But one cluster stopped us cold.

Cluster 203: The Silence Cluster

Cluster 203 contained 3,282 sentences. Here's what they say:

"There were no significant changes in our critical accounting policies since the end of fiscal 2014."
"No such impairment charges were recorded in 2022."
"There were no transfers into or out of Level 1, Level 2, or Level 3 during the three months ended March 31, 2023."
"There were no triggering events in the third quarter of 2010."
"There were no events of default for the 2025 Notes or 2029 Notes at December 31, 2022."
"During the six months ended June 30, 2022, the Company did not pay dividends on its shares of Common Stock."

Three thousand sentences about nothing happening.

No impairment. No changes. No defaults. No dividends. No triggering events. No transfers. Nothing.

And the embedding model — trained on hundreds of millions of text pairs to understand semantic similarity — decided these sentences belong together. Not because they share keywords (they don't — "impairment" and "dividends" and "triggering events" are completely different topics). But because they share structure: the explicit denial that something noteworthy occurred.

Why This Matters

SEC filings are legal documents. Every sentence is deliberate. When a company writes "there were no impairment charges," they're not making small talk. They're responding to an implicit question that the regulatory framework demands they answer.

The interesting part isn't that these sentences exist. It's that they cluster together in embedding space — which means the model learned that "denial of event X" and "denial of event Y" are semantically closer to each other than either is to a description of event X or event Y actually happening.

The model learned the concept of negation as disclosure.

The Classifier Nobody Would Build

We have 411 hand-built classifiers for SEC filings. Things like is_going_concern_warning, is_restructuring_initiated, is_material_weakness_disclosed. Human hypotheses about what matters.

Nobody built is_nothing_happened. Why would you? "Nothing happened" isn't a thesis.

But consider: a company that suddenly stops saying "there were no impairment charges" might be about to disclose impairment charges. A company whose filings are dense with Cluster 203 language is a company going out of its way to tell you everything is fine.

The absence of absence is presence.

550 Other Concepts

Cluster 203 was one of 551 natural groupings we found. Some confirmed existing classifiers — we found clusters that map cleanly to concepts we already built, like litigation language and executive departures. That's validation.

But many were new. Concepts no human defined:

  • Delisting warnings — Nasdaq non-compliance notices, minimum bid price violations
  • Troubled debt restructurings — TDRs, non-performing loans, non-accrual status
  • Covenant dividend restrictions — lenders blocking companies from paying dividends
  • Convertible debt conversions — debt-to-equity swap events
  • Material weakness disclosures — internal control failures

Twenty-four specific, event-like concepts that emerged purely from the geometry of language. No human hypothesis required.

The Pipeline Inversion

The traditional approach to building NLP classifiers for financial text:

  1. Human thinks of a concept ("going concern warnings might predict stock drops")
  2. Label training data
  3. Train classifier
  4. Test if it works
  5. Hope it does

This is hypothesis-first science. It works when humans have good intuition about what matters.

The problem: human intuition about what predicts stock movements is demonstrably poor. We ran a coherence audit on our 411 classifiers and found that the ones with the tightest semantic clustering — the ones most clearly capturing a real concept — were is_same_store_sales_declining and is_lateral_length_extension. Not "AI disruption." Not "paradigm shift." Same-store sales and lateral well extensions.

The boring stuff. The operational minutiae buried on page 47 of a 10-Q. The stuff no human would pick as their top hypothesis for what moves stocks.

So we inverted the pipeline:

  1. Embed all the sentences
  2. Let clustering find the natural structure
  3. Each cluster is a candidate concept
  4. Score them all
  5. Let the model decide what matters

No human gatekeeping. No "does this make economic sense?" filtering. The cluster exists in embedding space, it's semantically coherent, and either it predicts outcomes or it doesn't. XGBoost doesn't care if the concept is elegant.

What 768 Dimensions Can See

The reason this works now — and didn't work two years ago — is the embedding model.

Our first attempt used MiniLM, a 384-dimensional model. Everything mashed together. Clusters were meaningless vocabulary neighborhoods. "CEO was fired" and "CEO was hired" landed in the same region because both sentences contain "CEO."

E5-base-v2 operates in 768 dimensions with contrastive training. It learned that "fired" and "hired" are semantically opposite, not similar. The extra dimensions aren't noise — they're the structure that makes clustering meaningful.

In 384 dimensions, the space is too compressed for semantic nuance. In 768 dimensions, there's room for "nothing happened" to have its own neighborhood, distinct from "something happened" and "something might happen."

The Takeaway

When we let the data speak, it said something we didn't expect: that the most natural grouping in SEC language isn't about what companies disclose. It's about what they explicitly don't.

Three thousand sentences, from hundreds of companies, across decades of filings, all saying the same thing in different words: nothing to see here.

The embedding model heard them. And it put them together.