Classifier Quality

Every classifier must pass three independent quality gates before production deployment. Gate 1 — Separability: cross-validated precision on held-out data (THP@20 ≥ 75%, d′ ≥ 2.5). Gate 2 — Fire rate: must fire on less than 5% of real sentences (less than 1% for AI-discovered classifiers). Gate 3 — Coherence: positive sentences must cluster tightly in embedding space (silhouette ≥ 0.20). Of 977 classifiers built, 529 passed all three gates. AUC alone is not sufficient — 62 classifiers with strong AUC failed coherence and were removed, improving downstream IC by 48%. Browse the full classifier catalog →

AUC measures how well a classifier separates its training examples. It says nothing about what happens on 151 million real sentences. A classifier trained to detect layoff announcements — AUC 0.98 — scored “hired 500 employees to support expansion” with high confidence. It learned workforce vocabulary, not the layoff concept. Fire rate catches over-broad classifiers; coherence catches concept confusion. Each gate targets a different failure mode that AUC misses.

Two methods. 159 classifiers are human-curated from investment theses — concepts like “going concern warning” and “clinical trial milestone.” 313 classifiers were discovered by unsupervised clustering: we embedded 151M SEC sentences into 768-dimensional space, ran HDBSCAN, and found 551 natural groupings. Each cluster became a classifier candidate. 370 passed quality gates; 57 turned out to be boilerplate patterns and were excluded. 4 of the top 5 most predictive classifiers in the system were machine-discovered, not human-designed. See all 529 classifiers →

We use EDGAR filing date stamps (not period end dates). A 2-day embargo is applied post-filing. Models are trained with forward-walk methodology — no future data leaks into any historical score. Classifier weights are frozen at deployment; backtest scores are identical to production scores.

Signal Explorer & Evidence

Yes. Every signal in the explorer is backed by sentence-level evidence. Click any signal pill and a panel shows the exact sentences from the filing that triggered the classifier, along with confidence scores. No black boxes — every signal is provable and auditable. For AI-native access, Provenance Connect exposes the same signals to Claude and other MCP-compatible assistants.

Every company’s filing history creates a unique 469-dimensional “fingerprint” based on which classifiers fire. Clustering these profiles reveals 10 natural company archetypes — SPAC shells, distressed micro-caps, clinical-stage biotech, banks, and more. The spread between the best and worst archetype is 40×. This is a new kind of company classification built from language, not industry codes.

Generic sentiment outputs a single continuous score per document. Companies copy-paste 80%+ of their filing text quarter to quarter — sentiment barely moves. Our classifiers detect 472 specific binary events with clear economic meaning: a going concern warning either fired or it didn’t. Each signal is sentence-verifiable, directly countable, and model-ready without thresholding.

Yes. Scores are immutably versioned. When a new model version ships, it creates a new column; the old scores are preserved in perpetuity. Backtests remain reproducible across model updates — your backtest today is your backtest next year.

Data & Delivery

DuckDB or Parquet format, delivered via S3, API, or SFTP. Live latency is 4–6 hours after EDGAR filing. A free sample is available upon request. Full history runs back to 2014. See the full data layer overview →

10-K (annual), 10-Q (quarterly), and 8-K (current report) filings across 10,000+ companies. Ownership forms (3, 4, 5, 13F, SC 13G/D) are excluded as classifiers are not economically meaningful for those document types.

Yes. Custom classifiers are scoped, trained, and validated through the same three-gate quality system as the core catalog. Alternatively, our cluster discovery pipeline can find novel concepts in your target domain automatically. Contact us to discuss your specific signal requirements.

A multi-year dataset covering the full classifier catalog across all tickers and filing types — enough for initial IC measurement and model integration testing. Full definitions, quality metrics, and classifier documentation are included.

Still have questions?

Request a sample dataset and we'll walk you through the full primitive catalog.