Provenance Stream

Six layers of structured
corporate intelligence

Every data layer Provenance Stream delivers — what it is, how much of it there is, and what questions it answers.

Request Access →
87M Filing sentences
127M News sentences
457K Financial quarters
7,900 Companies tracked
529 Signal classifiers
9,400 Tickers in registry
01
SEC Filings · Sentence-Level
Filing Content
The Language Layer
Volume~87M classified sentences
Source filings~270K U.S. SEC filings
Forms10-K, 10-Q, 8-K + related
Coverage2019–2026 (primary: 2024+)
Embeddings768-dim, E5-base-v2
Classifiers per sentenceUp to all 529

Every sentence in roughly 270,000 U.S. SEC filings — 10-Ks, 10-Qs, 8-Ks, and related forms — turned into a 768-dimensional vector embedding so we can search by meaning rather than keyword. On top of the raw sentences, every sentence is also tagged with one or more of 529 themed classifiers.

Two completely different sentences like "orders have surged beyond our production capacity" and "we cannot fulfill backlog fast enough to meet customer commitments" land in the same classifier region even though they share zero keywords. That's the defining capability.

Example questions it answers

Find sentences across all 2025 10-Ks that discuss AI capex commitments.
What is Pfizer saying about pipeline setbacks in the latest 10-K?
Show me language across semiconductor companies about export controls in 2024-2025.
02
Quarterly XBRL
Financial Statements
The Numbers Layer
Volume~457K quarterly observations
Companies~4,400 public companies
HistoryBack to ~2009
Payload columns50 per quarter
SourceSEC XBRL filings
50-column payload — revenue, COGS, gross profit, operating income, interest expense, net income, EPS, shares outstanding, operating cash flow, free cash flow, capex, dividends paid, cash, inventory, total debt, operating lease assets/liabilities, stockholders' equity, YoY growth rates, and margin metrics.

Structured quarterly financial line items extracted from the XBRL data tagged inside every 10-K and 10-Q. Includes income statement, balance sheet, cash flow statement, and a set of derived ratios and growth rates — cooked from primary SEC filings, not third-party estimates.

Coverage was recently expanded from 37 to 50 columns per quarter. Every observation is linked back to the accession number of the filing it came from.

Example questions it answers

What was Apple's gross margin trend over the last 8 quarters?
Compute Palantir's 2-year revenue CAGR.
Show me Oracle's effective tax rate for FY2024.
03
Aggregated · Continuous
Company Signals
The Cooked Intelligence Layer
Volume~7,900 company snapshots
RefreshContinuously updated
Payload50+ fields per company
SourceInternal cascade pipeline

A continuously-refreshed snapshot per company that aggregates everything Provenance knows about that issuer — filing-language signals (echo, distress, mutation, entropy), credit metrics (spread, widening signals), price and volatility (ATR, beta, volume, moving-average cross), insider trading activity, 8-K event density, crisis intensity, and category tags.

Signal dimensions per company

Echo signals
echo_composite, distress_echo, positive_echo, echo_divergence, profile_mutation, entropy_surprise
Market data
price_close, return_5d/21d, ATR, beta_60d, volume_multiple, MA cross 50/200
Credit
credit_spread_zscore, percentile, widening/tightening flags, echo_widening
Insider
buying days 30d/90d, conviction_echo, active_signals array

Example questions it answers

Which industrials have rising distress_echo with restructuring language firing?
Give me the full signal profile on AAPL — echo trends, distress, credit spread, insider activity.
Find SPAC-like companies with anomalous filing entropy.
04
News · Press Releases
News & PR
The Public Narrative Layer
Sentences~127M sentences
Articles~4.2M articles
Coverage2020–2026 (MCP: 2023+)
SourcesGlobeNewswire, PRNewswire
LayersSentence + article-level
Two parallel indexes: sentence-level for semantic search on individual sentences; article-level with title + centroid embeddings, ticker tags, channel labels, author, and headline.

Every sentence in every news article and press release from major business newswires, vector-embedded so you can search by meaning. The article-level layer carries title, author, publication time, ticker tags, and source channel labels (Press Releases, Health Care, Commodities, Financial Services, etc.).

The key capability is cross-surface triangulation — pairing filing signals with news from the same time window to connect what management is telling regulators with what's being announced publicly. No other connector joins these two surfaces.

Example questions it answers

What FDA approval announcements were made by biotech companies in Q1 2026?
Show me press releases about workforce reductions in retail this quarter.
PFE's distress signal is rising — find news from the same window that might explain it.
05
Exhibit 10.x · CUAD
Material Contracts
The Legal Clause Layer
Contract sentences~3.9M vectors
Distinct agreements~39K
Extracted clauses~250K (CUAD-classified)
Coverage2025–2026
Clause taxonomy~40 CUAD clause types
Loaded, not yet exposed. The data is in the system — clause search and agreement-level tools are on the roadmap.

Material agreements filed as Exhibit 10.x in SEC filings — credit facilities, employment agreements, merger agreements, license agreements, and more. Sentences are vector-embedded for semantic search; agreements are summarized at the contract level; and individual clauses are extracted and classified using the CUAD taxonomy.

CUAD (Contract Understanding Atticus Dataset) is a standard legal-AI benchmark covering ~40 clause types, each annotated with risk level, favorability, and whether the clause survives termination.

What it will answer when live

Find change-of-control clauses that fire on a specific transaction type.
Compare termination-for-convenience provisions across recent credit facility amendments.
Show me anti-assignment clauses in employment agreements that survive termination.
06
Reference · Registry
Ticker Registry
The Identity Layer
Tickers~9,400 active US-listed
FieldsCIK, exchange, SIC, fiscal year-end, entity type
Freshnesslatest_filing_date per ticker
RefreshContinuously updated
SourceEDGAR + exchange listings

The canonical Provenance ticker registry — every active US-listed equity we track, with its CIK number, exchange (NYSE, Nasdaq, etc.), SIC code and description, fiscal-year-end date, entity type, and a latest_filing_date field for freshness checking.

This is the shared identity layer that underpins every other lookup. When you query by ticker, the registry resolves it to a CIK, which is the stable identifier across all other data layers.

Built by joining SEC EDGAR company facts with exchange listings. Used implicitly by every ticker-based operation — not exposed as a standalone tool, but essential to all of them.

529 Named Signal Classifiers

A central reference layer: the names, human-readable labels, descriptions, source (research-derived vs. LLM-derived), information coefficient, and signal weight for every classifier. This is the vocabulary that turns raw filing language into themed signal.

529
Total active classifiers
159
Human-curated classifiers
313
Machine-discovered clusters
0.85
Production signal threshold
Each classifier carries a direction (positive / negative / neutral), an information coefficient against forward returns, and an LLM weight reflecting alpha relevance.

Browse the full classifier catalog →

All data layers — totals & coverage

Data Layer Volume Coverage Status
SEC Filing Sentences ~87M sentences 2019–2026 Live
News & PR Sentences ~127M sentences 2020–2026 Partial (2023+)
News Article Metadata ~4.2M articles 2020–2026 Live
Material Contract Sentences ~3.9M sentences 2025–2026 Coming Soon
Contract Clauses (CUAD) ~250K clauses 2025–2026 Coming Soon
Quarterly XBRL Financials ~457K observations ~2009–2026 Live
Company Signal Snapshots ~7,900 companies Current (live) Live
Ticker Registry ~9,400 tickers Current (live) Live
Filing Theme Classifiers 529 classifiers Static lookup Live

Connect to all six layers

Provenance is by invitation. Request access and we'll add you to the allowlist.

Request Access →