Question 1

What is the three-gate quality system?

Accepted Answer

Every classifier must pass three independent quality gates before production deployment. Gate 1 — Separability: cross-validated precision on held-out data (THP@20 >= 75%). Gate 2 — Fire rate: must fire on less than 5% of real sentences. Gate 3 — Coherence: positive sentences must cluster tightly in embedding space (silhouette >= 0.20). Of 977 classifiers built, 529 passed all three gates.

Question 2

Why isn't AUC enough?

Accepted Answer

AUC measures how well a classifier separates its training examples — it says nothing about what happens on 151 million real sentences. A classifier trained to detect layoff announcements (AUC 0.98) scored 'hired 500 employees to support expansion' with high confidence. Fire rate catches over-broad classifiers; coherence catches concept confusion.

Question 3

How are classifiers discovered?

Accepted Answer

Two methods. 159 classifiers are human-curated from investment theses. 313 classifiers were discovered by unsupervised clustering: we embedded 151M SEC sentences into 768-dimensional space, ran HDBSCAN, and found 551 natural groupings. 4 of the top 5 most predictive classifiers in the system were machine-discovered, not human-designed.

Question 4

How is look-ahead bias prevented?

Accepted Answer

We use EDGAR filing date stamps (not period end dates). A 2-day embargo is applied post-filing. Models are trained with forward-walk methodology — no future data leaks into any historical score. Classifier weights are frozen at deployment; backtest scores are identical to production scores.

Question 5

Can I see the actual sentences behind a signal?

Accepted Answer

Yes. Every signal in the explorer is backed by sentence-level evidence. Click any signal pill and a panel shows the exact sentences from the filing that triggered the classifier, along with confidence scores. No black boxes — every signal is provable and auditable.

Question 6

What is a company classifier profile?

Accepted Answer

Every company's filing history creates a unique fingerprint based on which classifiers fire. Clustering these profiles reveals 10 natural company archetypes — SPAC shells, distressed micro-caps, clinical-stage biotech, banks, and more. The spread between the best and worst archetype is 40x.

Question 7

Are historical scores immutably frozen?

Accepted Answer

Yes. Scores are immutably versioned. When a new model version ships, it creates a new column; the old scores are preserved in perpetuity. Backtests remain reproducible across model updates — your backtest today is your backtest next year.

Question 8

How is data delivered?

Accepted Answer

DuckDB or Parquet format, delivered via S3, API, or SFTP. Live latency is 4–6 hours after EDGAR filing. A free sample is available upon request. Full history runs back to 2014.

Question 9

What filing types are covered?

Accepted Answer

10-K (annual), 10-Q (quarterly), and 8-K (current report) filings across 10,000+ companies. Ownership forms (3, 4, 5, 13F, SC 13G/D) are excluded as classifiers are not economically meaningful for those document types.

Question 10

Can we get custom classifiers built for our strategy?

Accepted Answer

Yes. Custom classifiers are scoped, trained, and validated through the same three-gate quality system as the core catalog. Our cluster discovery pipeline can also find novel concepts in your target domain automatically.

Question 11

What does the free sample include?

Accepted Answer

A multi-year dataset covering the full classifier catalog across all tickers and filing types — enough for initial IC measurement and model integration testing. Full definitions, quality metrics, and classifier documentation are included.

Frequently Asked Questions

Classifier Quality

Signal Explorer & Evidence

Data & Delivery

Still have questions?