How It Works — SEC Filing to Alpha Signal

01

Ingest

SEC EDGAR Monitoring

ARGOS continuously polls the SEC EDGAR full-text search and RSS feeds. New filings are detected within minutes of publication and queued for processing. The pipeline has continuous coverage going back to 2014 — every 10-K, 10-Q, and 8-K for 4,800+ liquid tickers.

Filing Types

10-K · 10-Q · 8-K · DEF 14A

Detection Latency

Near Real-time

Universe Coverage

4,800+ liquid US tickers

Historical Coverage

2014 – present (10+ years)

EDGAR RSS full-text search API deduplication queue-based pipeline 3.7M+ filings processed

02

Parse & Filter

Sentence-Level Decomposition

Raw HTML/XBRL is stripped and the filing text is segmented into individual sentences. Legal boilerplate — forward-looking statement disclaimers, risk factor repetitions, exhibit lists — is identified and removed using Hyperscan pattern matching before any classifier ever sees the text. This removes 60%+ of low-signal content and concentrates classifier power on substantive disclosure.

Input

Raw SEC EDGAR HTML / XBRL document

Noise Removed

60%+ boilerplate filtered before classification

Segmentation

sentence → embedding unit

Output

Clean sentence corpus per filing

Hyperscan pattern matching HTML stripping XBRL parsing sentence segmentation boilerplate removal

03

Embed, Classify & Validate

472 Classifiers. Three Quality Gates.

Each sentence is embedded into 768-dimensional space (E5-base-v2) and scored by every classifier in parallel. 159 are human-curated from investment theses. 313 were discovered by unsupervised clustering on 151M sentences — finding signals humans would never think to look for. But scoring is only half the story. AUC alone doesn’t catch classifiers that fire on the wrong things.

# A classifier with 0.98 AUC scored this sentence with high confidence:
"hired 500 employees to support expansion"
# The classifier was trained to detect layoff announcements.
# AUC said it was excellent. It wasn't.
      

AUC measures how well a classifier separates its training data. It says nothing about what happens on 151 million real sentences. We built three independent quality gates — each catches a different failure mode.

Gate 1: Separability Test

Cross-validated precision on held-out data. Can the classifier reliably separate its own training examples? If not, it hasn’t learned the concept.

THP@20 ≥ 75% · d′ ≥ 2.5 · Cluster ratio ≥ 0.5

Gate 2: Fire Rate

Score 1M real sentences. If a classifier fires on more than a few percent, it’s detecting a topic, not an event. Specificity drives quality — rare signals are tighter signals.

< 5% for human-designed · < 1% for cluster-discovered

Gate 3: Coherence Audit

Sample 1,000 high-confidence positives and measure whether they cluster in embedding space. A classifier can pass AUC and fire rate but still fire on unrelated sentences that share surface vocabulary. Coherence catches this.

Silhouette ≥ 0.20 · Positives must form a tight concept

Built

977 classifiers

Passed separability

700

Passed fire rate

586

Passed coherence

529 active

After boilerplate filter

472 in production

Removing 62 incoherent classifiers (passed AUC, failed coherence) improved downstream model IC by +48%. They weren’t just useless — they were actively injecting noise. Removing bad signal beats adding good signal.

Browse the full classifier catalog →

Model Architecture

logistic regression (calibrated)

Classifier Sources

159 human-curated + 313 cluster-discovered

Embedding Model

E5-base-v2 (768-dim)

Quality Gates

Separability + Fire Rate + Coherence

sentence embeddings logistic regression Platt scaling coherence audit cluster discovery three-gate validation

04

Aggregate & Deliver

Frozen Filing Vectors

Sentence-level scores are aggregated to the filing level — summed or max-pooled depending on the signal type. The result is a flat vector of classifier scores per accession number, ready to join to your pricing or fundamental data. Classifiers are frozen and immutable: the scores you backtest today will be identical to the scores you receive in production next year.

🦆

DuckDB

Single file, query in-process. Zero infrastructure.

📄

Parquet

Columnar, compressed. Drop into Spark, Pandas, or Polars.

☁️

S3 Bucket

Daily incremental drops or full historical dataset.

🔌

REST API

Query by ticker, date range, or accession number.

🔒

SFTP

Scheduled drops to your existing data pipeline.

🤖

MCP Server

Direct AI agent access. Query signals in natural language. Coming soon

sum aggregation max pooling per-accession vectors frozen classifiers daily incremental full backfill available

From Filing to Alpha Signal
in Four Steps

SEC EDGAR Monitoring

Sentence-Level Decomposition

472 Classifiers. Three Quality Gates.

Gate 1: Separability Test

Gate 2: Fire Rate

Gate 3: Coherence Audit

Frozen Filing Vectors

Your backtest today is your production model tomorrow.

See the signals in action

From Filing to Alpha Signalin Four Steps

SEC EDGAR Monitoring

Sentence-Level Decomposition

472 Classifiers. Three Quality Gates.

Gate 1: Separability Test

Gate 2: Fire Rate

Gate 3: Coherence Audit

Frozen Filing Vectors

Your backtest today is your production model tomorrow.

See the signals in action

From Filing to Alpha Signal
in Four Steps