The Classifier Engine
Every signal is only as trustworthy as the sentence behind it.
Provenance Core applies trained NLP classifiers to every SEC filing at the sentence level — and links every output directly back to the regulatory language that generated it.
Most financial data products sever the source chain.
A score arrives. A flag fires. A sentiment reads positive. But the text that generated it — the exact sentence in the exact filing — is gone. For compliance teams, model governance frameworks, and institutional DDQ requirements, a signal without a source is not evidence. It is an assertion.
Four steps. Zero gaps.
Read
Every sentence in every SEC filing — 10-K, 10-Q, and 8-K — is processed the moment it lands on EDGAR. No document is read partially. Every sentence in every filing is evaluated, every time.
Classify
Each sentence is evaluated against the full classifier library using 768-dimensional semantic vectors — not keywords. Two sentences with zero shared words will fire the same classifier if they describe the same business condition.
Score
Every classifier match receives a confidence score from the logistic regression model. Filing-level signal strength is the SUM of all matched sentence scores — capturing how prevalently a theme runs through the document, not just whether it appeared once.
Link
Every matched sentence is preserved verbatim and paired with a live URL to the original filing on SEC EDGAR. The chain from signal output back to primary source is never broken — for every classifier, every sentence, every filing.
Three outputs. Every filing.
Named Business Signals
Hundreds of named NLP classifiers, each detecting a specific business condition — not a general topic, a named signal. Demand Accelerating. Operating Margin Declining. Going Concern. Backlog Building. Every classifier has a name, a definition, and a trained model behind it.
Sentence-Level Evidence
Every matched sentence is preserved verbatim. No paraphrase. No summary. The exact regulatory language that triggered the signal — alongside the classifier name, confidence score, and its position within the filing.
Live Source Links
Every output carries a direct URL to the original SEC filing on EDGAR. Not a cached copy. Not a data vendor intermediary. The primary regulatory disclosure, accessible to any reviewer in seconds.
What it looks like in practice
Three tickers. Verbatim sentences from live filings. Every detail auditable.
"Consistent with our prior expectations, tendering activity and contract awards increased during the latter part of 2025."
Eight consecutive filings. Echo climbed from 2.61 to 4.76. Two years of unbroken order-visibility narrative.
Live link included in every output record →"Gross margin for the third quarter of fiscal 2026 decreased 130 basis points to 40.2% primarily due to higher tariffs in North America."
19 consecutive filings — five years. Echo at its ceiling, reinforced faster than it decays. Structural, not cyclical.
Live link included in every output record →"Revenue for the fourth quarter was a record $68.1 billion, up 73% from a year ago and up 20% sequentially."
Streak of 12 at peak echo. Three years of demand language in every quarterly disclosure. Invisible in any single filing alone.
Live link included in every output record →Built for every financial professional who needs data they can explain
Quant Analysts
Echo and Streak become quantitative inputs — how long has a signal been building, how persistent, how rare? Every data point carries a live SEC link. When the risk team asks where it came from, the answer is one click away.
Fundamental Analysts
Instead of starting from a blank page, analysts start from what the company said in its own words — and how long it has been saying it. Provenance surfaces companies where signals are building, breaking, or converging, then shows the verbatim sentences.
Model Risk & Compliance
Every Provenance Core output carries the verbatim source sentence, confidence score, and a live link to the original SEC EDGAR filing. The audit trail is complete before they ask for it. Provenance does not just pass review — it makes the review easy.
ML Engineers
Millions of labeled tuples — sentence, classifier, confidence score, live source URL. Fine-tuning on this data teaches models to cite every assertion, directly addressing hallucination at the training data layer.
Provenance Core is the foundation.
Everything else in the Provenance platform is built on top of what Provenance Core produces. Provenance Stream extends classifier signals across six independent data channels. Provenance Reports packages classifier outputs into formatted analyst reports. Provenance Connect exposes the full platform via a native MCP server.
Every signal. Every source. Every time.
Sample datasets for NVDA, NKE, and RIG are available immediately in .JSON and .CSV formats.