Technical writing
Building the Cross-Agency Regulatory Entity Graph: 50M+ Records, One Join
The Federal Regulatory Data Hub indexes 208 datasets across 45 agencies. Getting any single dataset into a queryable API is straightforward. The hard part is answering the question that compliance teams, journalists, and researchers actually ask: what is every regulatory event for this company, across all agencies, sorted by date?
Boeing has NTSB aviation accidents, FAA airworthiness directives, SEC 8-K filings, DOJ settlement agreements, OSHA violations, and OFAC screening entries — all under different identifiers in different databases. ExxonMobil has 30+ EPA enforcement cases, SEC EDGAR filings, OFAC exposure through subsidiaries, and IRS nonprofit-adjacent entities. Getting any of those in isolation is easy. Joining them in a single query is the problem this post describes solving.
The identifier problem
Federal datasets use different identifiers for the same entity:
- CIK — SEC Central Index Key (EDGAR)
- ticker — exchange symbol (SEC, CFTC, FINRA)
- UEI — Unique Entity Identifier (SAM.gov, USAspending)
- LEI — Legal Entity Identifier (CFTC, FinCEN)
- DUNS — legacy D-U-N-S Number (contracts, grants)
- NPI — National Provider Identifier (CMS, HHS)
- EIN — IRS Employer Identification Number (IRS, DOL)
- free-text name — OFAC, EPA enforcement, CFPB, DOJ press
A company can appear in each of these ID spaces under different forms of its name and with no shared identifier across datasets. Pfizer's CIK in EDGAR, its UEI in USAspending, and its NPI as a healthcare manufacturer are three different numbers with no foreign-key relationship in any government database.
The entity_master table
The entity bridge is a single SQLite table on Cloudflare D1:
CREATE TABLE entity_master ( id INTEGER PRIMARY KEY, canonical TEXT NOT NULL, -- e.g. "Pfizer Inc." slug TEXT UNIQUE, -- e.g. "PFE" cik TEXT, -- SEC EDGAR ticker TEXT, uei TEXT, -- SAM.gov / USAspending lei TEXT, -- CFTC / FinCEN duns TEXT, -- legacy npi TEXT, -- CMS ein TEXT, -- IRS aliases TEXT, -- JSON array of known name variants updated_at TEXT ); CREATE INDEX em_cik ON entity_master(cik); CREATE INDEX em_ticker ON entity_master(ticker); CREATE INDEX em_uei ON entity_master(uei); CREATE INDEX em_lei ON entity_master(lei); CREATE INDEX em_npi ON entity_master(npi);
Each row maps a single real-world entity to every identifier we can confirm for it. Thealiases column is a JSON array of every name variant we've seen across datasets — “Pfizer Inc.”, “PFIZER INC”, “Pfizer Pharmaceuticals”, “Pfizer Inc (NY)” — used for fuzzy matching on datasets that don't carry a structured identifier.
Building the mapping: three passes
Pass 1 — seed from EDGAR + SAM.gov
SEC EDGAR's company index carries CIK → ticker → name for ~10k public companies. SAM.gov's entity registration carries UEI → EIN → CAGE → name for every federal contractor and grantee. These two seed the entity_master with structured ID pairs for the highest-coverage entities: public companies and federal vendors.
# Seed: EDGAR company.idx → entity_master
for company in edgar_companies:
upsert entity_master(
cik=company.cik,
ticker=company.ticker,
canonical=company.name,
aliases=[company.name, company.former_names...]
)
# Augment: SAM.gov registration → match on EIN
for vendor in sam_entities:
match = find_by_ein(vendor.ein)
if match:
update entity_master(uei=vendor.uei) where id=match.id
else:
insert entity_master(uei=vendor.uei, canonical=vendor.name, ...)Pass 2 — LEI bridge via GLEIF
GLEIF (Global Legal Entity Identifier Foundation) publishes a public mapping of LEI → legal name → registered country. For entities in our master that have an EIN or CIK, we fuzzy-match against the GLEIF corpus to acquire the LEI. LEI unlocks CFTC COT data, FinCEN FBAR enforcement, and cross-border screening list entries.
Pass 3 — name normalization for free-text datasets
Datasets like OFAC, EPA enforcement, CFPB, and DOJ press releases don't carry structured identifiers. An EPA enforcement case lists “ExxonMobil Pipeline Company, a subsidiary of ExxonMobil Corporation”. The matching problem is entity disambiguation at scale.
We run a two-stage match: (1) exact normalized-name match against the aliasesarray, (2) token-sorted cosine similarity against the canonical name using TF-IDF vectors built over the entity_master corpus. Matches below 0.85 similarity go to a review queue rather than automatic linkage.
The cross-agency query
Once entity_master is populated, a cross-agency query is just a parameterized lookup:
# Resolve input to entity_master.id
entity = resolve("ExxonMobil") # ticker, CIK, UEI, or name
# Fan out to every dataset
events = []
events += query(sec_filings, entity.cik)
events += query(sec_8k, entity.cik)
events += query(epa_enforcement, entity.slug, fuzzy=True)
events += query(ofac_recent, entity.aliases)
events += query(doj_press, entity.aliases)
events += query(cfpb_enforcement, entity.slug)
events += query(msha_violations, entity.slug)
events += query(usaspending_contracts, entity.uei)
# Sort by date, return unified timeline
return sorted(events, key=lambda e: e.date, reverse=True)In practice this runs as parallel D1 queries via the Cloudflare Workers fetch API, with results merged server-side. Median latency for a cross-agency entity query is under 200ms on the edge — D1 query time dominates, not network.
The entity endpoint
The result is available at:
curl https://api.ai-analytics.org/entity/XOM curl https://api.ai-analytics.org/entity/PFE/material-events?days=90 curl https://api.ai-analytics.org/entity/WELLS-FARGO
Each response includes a unified timeline of regulatory events sorted by date, with each event tagged to its source dataset and carrying the original government URL in its _source envelope. The ExxonMobil endpoint currently returns 30+ EPA enforcement cases, 12 OFAC screening hits (via subsidiaries), 400+ EDGAR filings, and 5 DOJ-adjacent cases.
What breaks
A few categories of entity are hard to resolve well:
- Subsidiaries. OFAC often designates a subsidiary, not the parent. EPA enforcement cases frequently name the operating entity rather than the publicly-traded parent. We maintain a subsidiary→parent mapping for ~3k major corporate hierarchies but coverage is incomplete.
- Name collisions. “National Energy Corporation” exists in 40+ states. Without a disambiguating identifier, cross-dataset joins produce false positives. We flag low-confidence matches in the response.
- Individuals. OFAC designates individuals. HHS-OIG excludes individual providers (by NPI). SEC enforcement actions name executives. The entity bridge currently treats individuals and organizations in the same table — this works but creates some noise in common-name disambiguation.
- Government entities. City of Los Angeles, LAUSD, and various public hospitals appear in SAM.gov (as grantees), CMS (as providers), and HHS-OIG (as excluded entities). Their UEIs resolve correctly; their NPI→UEI bridge is incomplete.
What this enables
The cross-agency join is what separates a dataset catalog from a regulatory intelligence platform. Some of the queries it now makes possible in one GET:
- Every regulatory event for a company in a due-diligence workflow, without manually querying SEC, EPA, DOJ, and OFAC separately.
- Risk scoring: is this vendor on any of the 30+ screening lists, and how recently were its enforcement cases resolved?
- M&A diligence: what regulatory tail does the acquisition target carry across agencies?
- Investigative journalism: who in the OFAC SDN list also appears in the FDIC enforcement database, and what are the dollar amounts?
The entity bridge is queryable directly at api.ai-analytics.org/entity/ and via the MCP server (38+ tools). The screening endpoint wraps it with a 0–100 risk score across all enforcement lists.
Previous in this series: Building the Federal Regulatory Data Hub on Cloudflare D1: 50M+ records at the edge →
For how the entity identifiers in this bridge are normalized across CIK, UEI, LEI, DUNS, and NPI: Entity ID normalization in the Federal Regulatory Data Hub →
For how the entity bridge resolves entity names when no shared identifier exists — alias explosion, Jaro-Winkler and TF-IDF fuzzy matching, and sanctions evasion detection: Name matching in federal regulatory data: aliases, subsidiaries, and sanctions evasion →
Next in this series: Compliance screening across 30+ federal enforcement lists: how the risk score works →
For near-real-time OFAC sanctions, SAM debarment, and enforcement action webhooks delivered via the entity bridge: Federal Regulatory Data Hub change alerts →
For the entity subscription layer built on top of this bridge — EntitySubscription model, cross-list fan-out, severity scoring, and portfolio-level change monitoring: Entity subscriptions in the Federal Regulatory Data Hub: per-entity change monitoring across 30+ enforcement lists →
For how stable canonical IDs are generated and maintained across all 208 datasets — SHA-256 ID generation, merge/split events, and subscriber continuity guarantees: Canonical entity IDs in the Federal Regulatory Data Hub: stable identifiers across 208 federal datasets →