Six layers of structured
corporate intelligence
Every data layer Provenance Stream delivers — what it is, how much of it there is, and what questions it answers.
Every sentence in roughly 270,000 U.S. SEC filings — 10-Ks, 10-Qs, 8-Ks, and related forms — turned into a 768-dimensional vector embedding so we can search by meaning rather than keyword. On top of the raw sentences, every sentence is also tagged with one or more of 529 themed classifiers.
Two completely different sentences like "orders have surged beyond our production capacity" and "we cannot fulfill backlog fast enough to meet customer commitments" land in the same classifier region even though they share zero keywords. That's the defining capability.
Example questions it answers
Structured quarterly financial line items extracted from the XBRL data tagged inside every 10-K and 10-Q. Includes income statement, balance sheet, cash flow statement, and a set of derived ratios and growth rates — cooked from primary SEC filings, not third-party estimates.
Coverage was recently expanded from 37 to 50 columns per quarter. Every observation is linked back to the accession number of the filing it came from.
Example questions it answers
A continuously-refreshed snapshot per company that aggregates everything Provenance knows about that issuer — filing-language signals (echo, distress, mutation, entropy), credit metrics (spread, widening signals), price and volatility (ATR, beta, volume, moving-average cross), insider trading activity, 8-K event density, crisis intensity, and category tags.
Signal dimensions per company
echo_composite, distress_echo, positive_echo, echo_divergence, profile_mutation, entropy_surprise
price_close, return_5d/21d, ATR, beta_60d, volume_multiple, MA cross 50/200
credit_spread_zscore, percentile, widening/tightening flags, echo_widening
buying days 30d/90d, conviction_echo, active_signals array
Example questions it answers
Every sentence in every news article and press release from major business newswires, vector-embedded so you can search by meaning. The article-level layer carries title, author, publication time, ticker tags, and source channel labels (Press Releases, Health Care, Commodities, Financial Services, etc.).
The key capability is cross-surface triangulation — pairing filing signals with news from the same time window to connect what management is telling regulators with what's being announced publicly. No other connector joins these two surfaces.
Example questions it answers
Material agreements filed as Exhibit 10.x in SEC filings — credit facilities, employment agreements, merger agreements, license agreements, and more. Sentences are vector-embedded for semantic search; agreements are summarized at the contract level; and individual clauses are extracted and classified using the CUAD taxonomy.
CUAD (Contract Understanding Atticus Dataset) is a standard legal-AI benchmark covering ~40 clause types, each annotated with risk level, favorability, and whether the clause survives termination.
What it will answer when live
The canonical Provenance ticker registry — every active US-listed equity we track, with its CIK number, exchange (NYSE, Nasdaq, etc.), SIC code and description, fiscal-year-end date, entity type, and a latest_filing_date field for freshness checking.
This is the shared identity layer that underpins every other lookup. When you query by ticker, the registry resolves it to a CIK, which is the stable identifier across all other data layers.
529 Named Signal Classifiers
A central reference layer: the names, human-readable labels, descriptions, source (research-derived vs. LLM-derived), information coefficient, and signal weight for every classifier. This is the vocabulary that turns raw filing language into themed signal.
All data layers — totals & coverage
| Data Layer | Volume | Coverage | Status |
|---|---|---|---|
| SEC Filing Sentences | ~87M sentences | 2019–2026 | Live |
| News & PR Sentences | ~127M sentences | 2020–2026 | Partial (2023+) |
| News Article Metadata | ~4.2M articles | 2020–2026 | Live |
| Material Contract Sentences | ~3.9M sentences | 2025–2026 | Coming Soon |
| Contract Clauses (CUAD) | ~250K clauses | 2025–2026 | Coming Soon |
| Quarterly XBRL Financials | ~457K observations | ~2009–2026 | Live |
| Company Signal Snapshots | ~7,900 companies | Current (live) | Live |
| Ticker Registry | ~9,400 tickers | Current (live) | Live |
| Filing Theme Classifiers | 529 classifiers | Static lookup | Live |