Provenance Core

The Classifier Engine

Every signal is only as trustworthy as the sentence behind it.

Provenance Core applies trained NLP classifiers to every SEC filing at the sentence level — and links every output directly back to the regulatory language that generated it.

Most financial data products sever the source chain.

A score arrives. A flag fires. A sentiment reads positive. But the text that generated it — the exact sentence in the exact filing — is gone. For compliance teams, model governance frameworks, and institutional DDQ requirements, a signal without a source is not evidence. It is an assertion.

Four steps. Zero gaps.

1
Step One

Read

Every sentence in every SEC filing — 10-K, 10-Q, and 8-K — is processed the moment it lands on EDGAR. No document is read partially. Every sentence in every filing is evaluated, every time.

Continuous processing. No delay between publication and signal output.
2
Step Two

Classify

Each sentence is evaluated against the full classifier library using 768-dimensional semantic vectors — not keywords. Two sentences with zero shared words will fire the same classifier if they describe the same business condition.

Named signals: Demand Accelerating · Gross Margin Compression · Going Concern · Backlog Building · and hundreds more.
3
Step Three

Score

Every classifier match receives a confidence score from the logistic regression model. Filing-level signal strength is the SUM of all matched sentence scores — capturing how prevalently a theme runs through the document, not just whether it appeared once.

High 0.85–1.0 · Medium 0.60–0.84 · Low 0.50–0.59. All stored. Consumer aggregates by use case.
4
Step Four

Link

Every matched sentence is preserved verbatim and paired with a live URL to the original filing on SEC EDGAR. The chain from signal output back to primary source is never broken — for every classifier, every sentence, every filing.

No signal without a sentence. No sentence without a filing. Every output auditable to the primary regulatory disclosure.

Three outputs. Every filing.

Named Business Signals

Hundreds of named NLP classifiers, each detecting a specific business condition — not a general topic, a named signal. Demand Accelerating. Operating Margin Declining. Going Concern. Backlog Building. Every classifier has a name, a definition, and a trained model behind it.

Sentence-Level Evidence

Every matched sentence is preserved verbatim. No paraphrase. No summary. The exact regulatory language that triggered the signal — alongside the classifier name, confidence score, and its position within the filing.

Live Source Links

Every output carries a direct URL to the original SEC filing on EDGAR. Not a cached copy. Not a data vendor intermediary. The primary regulatory disclosure, accessible to any reviewer in seconds.

What it looks like in practice

Three tickers. Verbatim sentences from live filings. Every detail auditable.

RIG
Backlog Building

"Consistent with our prior expectations, tendering activity and contract awards increased during the latter part of 2025."

RIG 10-K · Filed 2026-02-23
Accession 0001451505-26-000018
Confidence: 0.88 · Echo: 4.76 · Streak: 8

Eight consecutive filings. Echo climbed from 2.61 to 4.76. Two years of unbroken order-visibility narrative.

Live link included in every output record →
NKE
Op. Margin Declining

"Gross margin for the third quarter of fiscal 2026 decreased 130 basis points to 40.2% primarily due to higher tariffs in North America."

NKE 10-Q · Filed 2026-04-01
Confidence: 0.91 · Echo: 4.96 · Streak: 19

19 consecutive filings — five years. Echo at its ceiling, reinforced faster than it decays. Structural, not cyclical.

Live link included in every output record →
NVDA
Demand Accelerating

"Revenue for the fourth quarter was a record $68.1 billion, up 73% from a year ago and up 20% sequentially."

NVDA 8-K · Filed 2026-02-26
Confidence: 0.94 · Echo: 5.66 · Streak: 12

Streak of 12 at peak echo. Three years of demand language in every quarterly disclosure. Invisible in any single filing alone.

Live link included in every output record →
434M+ Signal hits stored
30 years SEC filing history
Every sentence Linked to its source
10.5% High-confidence matches

Built for every financial professional who needs data they can explain

Quant Analysts

Echo and Streak become quantitative inputs — how long has a signal been building, how persistent, how rare? Every data point carries a live SEC link. When the risk team asks where it came from, the answer is one click away.

Fundamental Analysts

Instead of starting from a blank page, analysts start from what the company said in its own words — and how long it has been saying it. Provenance surfaces companies where signals are building, breaking, or converging, then shows the verbatim sentences.

Model Risk & Compliance

Every Provenance Core output carries the verbatim source sentence, confidence score, and a live link to the original SEC EDGAR filing. The audit trail is complete before they ask for it. Provenance does not just pass review — it makes the review easy.

ML Engineers

Millions of labeled tuples — sentence, classifier, confidence score, live source URL. Fine-tuning on this data teaches models to cite every assertion, directly addressing hallucination at the training data layer.

Provenance Core is the foundation.

Everything else in the Provenance platform is built on top of what Provenance Core produces. Provenance Stream extends classifier signals across six independent data channels. Provenance Reports packages classifier outputs into formatted analyst reports. Provenance Connect exposes the full platform via a native MCP server.

Provenance Core Active Provenance Stream Provenance Reports Coming Soon Provenance Connect

Every signal. Every source. Every time.

Sample datasets for NVDA, NKE, and RIG are available immediately in .JSON and .CSV formats.