Technical writing

Canonical entity IDs in the Federal Regulatory Data Hub: stable identifiers across 208 federal datasets

· AI Analytics
RegulatoryInfrastructureData Engineering

The Federal Regulatory Data Hub indexes 208 federal datasets across 45 agencies. Each dataset assigns its own identifier to the companies and individuals it tracks. OFAC's SDN list uses a numeric SDN entry number. SAM.gov issues 12-character alphanumeric UEIs. SEC EDGAR assigns CIK numbers. FDA uses Facility Establishment Identifiers (FEI). EPA assigns FRS IDs through its Facility Registry Service. None of these systems share a foreign key with any other. The same legal entity can appear in all five databases under five completely different identifiers and a dozen name variants — with no shared key to link them.

The canonical_id layer is the Hub's answer to this fragmentation. Every entity in the system — organization, individual, vessel, or aircraft — receives exactly one canonical_id: a stable, deterministic identifier that persists across source changes, name corrections, and dataset expansions. When a company gets a new UEI after a SAM.gov re-registration, the canonical_id does not change. When an OFAC listing is amended with a new alias, the canonical_id does not change. The canonical_id is the stable anchor that lets subscribers and API consumers build durable integrations without coupling their code to the volatility of any individual source system's identifiers.

The identifier fragmentation problem

Consider Huawei Technologies. In OFAC's SDN list, the company appears as “HUAWEI TECHNOLOGIES CO., LTD.” — all caps, with a period after “CO” and after “LTD”. In SAM.gov, the same company is registered as “Huawei Technologies Co. Ltd.” — mixed case, with a period after “Co” but not after “Ltd”. SEC EDGAR knows the company as CIK 0001645590. FDA's FEI registry holds FEI 3003374098 for its US-registered facilities. EPA's FRS assigns ID 110007044932.

That is five datasets, five identifier schemes, and two name variants for a single legal entity — before accounting for subsidiaries, joint ventures, or the dozens of additional name variants that appear in enforcement actions and export control records. No government system links any of these together. A query for “show me all federal regulatory events involving Huawei” requires knowing which identifier format each of the 208 datasets expects, issuing 197 separate parameterized queries, and then deduplicating the results by hand.

The canonical_id layer collapses this into a single stable handle: em_7f3a9c12b4d8e91a. Query that ID against any of the 208 datasets and you get back all events across all agencies — no per-dataset ID juggling required.

The EntityMasterRecord interface

Each entity in the Hub is represented by an EntityMasterRecord. Thecanonical_id is the primary key; the source_ids map holds every known source-specific identifier for that entity, keyed by dataset namespace:

interface EntityMasterRecord {
  canonical_id: string;        // 'em_7f3a9c12b4d8e91a'
  display_name: string;
  entity_type: 'ORGANIZATION' | 'INDIVIDUAL' | 'VESSEL' | 'AIRCRAFT';
  source_ids: Record<string, string[]>;
  // example: {
  //   'ofac_sdn':  ['SDN-12345'],
  //   'sam_uei':   ['ABCDEF123456'],
  //   'sec_cik':   ['0001645590'],
  //   'fda_fei':   ['3003374098'],
  //   'epa_frs':   ['110007044932']
  // }
  confidence_score: number;   // 0.0-1.0 composite matching confidence
  created_at: string;
  version: number;
  is_active: boolean;
}

The source_ids map uses string arrays rather than single strings because a single entity can hold multiple IDs within the same source system. A company acquired through a merger may retain both its own OFAC SDN entry and the acquired entity's former SDN entry — both belong insource_ids['ofac_sdn']. Healthcare organizations routinely have dozens of NPI entries, one per facility. The array-of-strings schema handles these cases without schema evolution.

The confidence_score is a composite value across all source linkages used to build the record. A record where all source_ids were linked via authoritative cross-references (matching EINs across SAM.gov and EDGAR, for instance) carries a score near 1.0. A record where the OFAC and FDA linkage was established through fuzzy name matching carries a lower score that reflects the uncertainty in that specific link. Callers can filter on confidence_score to apply their own threshold — compliance workflows typically require 0.95 or above.

Canonical ID generation

The canonical_id is deterministic: the same entity, re-ingested from scratch on any future date, produces the same ID. This determinism is load-bearing — it means the system can be rebuilt from source data without invalidating any subscriber's stored references or requiring a migration to update stored IDs.

For organizations: canonical_id = 'em_' + the first 16 hex characters of SHA-256(entity_type + '|' + primary_normalized_name + '|' + earliest_source_date). For individuals: SHA-256(entity_type + '|' + last_name_normalized + '|' + first_name_normalized + '|' + dob_or_empty). The normalization applied to names before hashing is the same normalization applied throughout the pipeline — stripping punctuation, uppercasing, expanding abbreviations, removing legal suffixes — so “Huawei Technologies Co., Ltd.” and “HUAWEI TECHNOLOGIES CO LTD” produce the same hash input and therefore the same canonical_id.

The Cloudflare Worker implementation uses the SubtleCrypto API, which is available in the Workers runtime without any additional imports:

async function generateCanonicalId(
  entityType: 'ORGANIZATION' | 'INDIVIDUAL' | 'VESSEL' | 'AIRCRAFT',
  params: {
    primaryNormalizedName?: string;
    earliestSourceDate?: string;
    lastNameNormalized?: string;
    firstNameNormalized?: string;
    dobOrEmpty?: string;
  }
): Promise<string> {
  let hashInput: string;

  if (entityType === 'INDIVIDUAL') {
    hashInput = [
      entityType,
      params.lastNameNormalized ?? '',
      params.firstNameNormalized ?? '',
      params.dobOrEmpty ?? '',
    ].join('|');
  } else {
    hashInput = [
      entityType,
      params.primaryNormalizedName ?? '',
      params.earliestSourceDate ?? '',
    ].join('|');
  }

  const encoder = new TextEncoder();
  const data = encoder.encode(hashInput);
  const hashBuffer = await crypto.subtle.digest('SHA-256', data);
  const hashArray = Array.from(new Uint8Array(hashBuffer));
  const hashHex = hashArray.map(b => b.toString(16).padStart(2, '0')).join('');

  return 'em_' + hashHex.slice(0, 16);
}

// Example — Huawei Technologies:
// entityType: 'ORGANIZATION'
// primaryNormalizedName: 'HUAWEI TECHNOLOGIES'  (after normalization)
// earliestSourceDate: '2019-05-16'              (first OFAC listing date)
// => canonical_id: 'em_7f3a9c12b4d8e91a'

The 16-hex-character suffix gives 64 bits of address space — approximately 1.8 × 1019possible IDs. At the current corpus size of roughly 2.1 million entities, the birthday-paradox collision probability is negligible (approximately 1.2 × 10-10). The system does check for collisions at insert time and raises an error if a newly computed ID collides with an existing record whose hash inputs differ — in practice this has not occurred in production.

EntityVersion table

The canonical_id is immutable. The record it points to is not. When source data changes — a new alias is added to an OFAC entry, a name spelling is corrected in SAM.gov, a new dataset links to the entity — a new version of the record is created rather than mutating the existing row. The current version is always the row where effective_to IS NULL.

CREATE TABLE entity_canonical_versions (
  canonical_id  TEXT NOT NULL,
  version       INTEGER NOT NULL,
  display_name  TEXT NOT NULL,
  source_ids    TEXT NOT NULL,  -- JSON
  confidence    REAL NOT NULL,
  effective_from TEXT NOT NULL, -- ISO8601
  effective_to  TEXT,           -- NULL = current
  change_reason TEXT NOT NULL,  -- 'ALIAS_ADDED', 'SOURCE_ADDED', 'MERGE', 'SPLIT', 'CORRECTION'
  PRIMARY KEY (canonical_id, version)
);

The change_reason vocabulary is intentionally narrow. ALIAS_ADDED covers the common case of a source dataset publishing a new name variant for an existing entity.SOURCE_ADDED records when a new dataset first links to a canonical_id that was previously only known from other sources. CORRECTION covers name spelling fixes and data quality updates that do not reflect a real-world change in the entity. MERGE and SPLITare covered in detail below — they are structurally different events that carry their own propagation mechanics.

Version history is retained indefinitely. Subscribers can query any past version of an entity record by specifying a version or an as_of timestamp — useful for compliance audits that need to reconstruct exactly what the system knew about an entity at a given point in time.

Merge events

Approximately 0.3% of ingestion batches trigger a merge event. A merge occurs when evidence accumulates that two existing canonical_ids refer to the same real-world entity. The typical trigger is cross-source corroboration: an entity appears in OFAC under one normalized name with confidence 0.87, and a new dataset arrives that links the same entity via an authoritative shared identifier (EIN, LEI, or CAGE code), pushing composite confidence high enough to confirm the match.

The merge protocol: the lower-confidence canonical_id is retired and becomes an alias of the higher-confidence one. A new version of the surviving record is created with change_reason = 'MERGE', incorporating all source_ids from both predecessors. The retired canonical_id is never deleted — it continues to exist in entity_canonical_versions with a final version row marking it as merged.

Consumers querying the retired canonical_id receive a 301-style redirect response: the API returns the record for the surviving canonical_id along with a merged_into field containing the new canonical_id and the merge date. The entity_id_mapping table (see below) is updated so all source IDs that previously mapped to the retired canonical_id now route to the surviving one. All active subscriptions on the retired canonical_id are migrated to the surviving canonical_id, and a webhook fires to all affected subscribers with the full merge event payload — both old IDs, the new canonical_id, the effective date, and the confidence that triggered the merge.

The merge propagation is transactional: the version row creation, the entity_id_mapping updates, the subscription migration, and the webhook enqueue all happen within a single D1 write batch. If any step fails, the entire merge is rolled back and retried on the next ingestion cycle.

Split events

Split events are rare — less than 0.01% of records — and represent the inverse problem: a canonical_id that was incorrectly assigned to two distinct real-world entities. Splits typically surface through human review of anomalous data patterns: an entity record whose source_ids span two incompatible geographies, or whose aliases include names that belong to two separate companies that happen to share a corporate name prefix.

The split protocol: the original canonical_id is retired with change_reason = 'SPLIT'. Two new canonical_ids are generated from scratch — each is computed deterministically from its own set of hash inputs, as described above. All source_ids from the original are partitioned between the two new IDs based on which real-world entity each source record actually refers to. The entity_id_mapping table is updated accordingly, and a SPLIT webhook fires to all affected subscribers carrying both new canonical_ids. Subscribers that had built integrations on the original canonical_id must update their references — unlike merges (which preserve backward-compatibility via redirect), splits require active subscriber attention because the system cannot automatically determine which of the two new IDs a subscriber's downstream records should reference.

EntityAlias table

Every known name variant for an entity — AKAs from OFAC, former names from corporate filings, phonetic transliterations from international watchlists, NFE (name formerly entered) corrections — is recorded in entity_aliases with an effective date range and a language code:

CREATE TABLE entity_aliases (
  canonical_id TEXT NOT NULL,
  alias_name   TEXT NOT NULL,
  alias_type   TEXT NOT NULL,  -- 'AKA', 'FKA', 'NFE', 'PHONETIC'
  source       TEXT NOT NULL,
  effective_from TEXT NOT NULL,
  effective_to TEXT,
  language_code TEXT DEFAULT 'en',
  FOREIGN KEY (canonical_id) REFERENCES entity_canonical_versions(canonical_id)
);

The alias_type vocabulary maps directly to OFAC's SDN notation.AKA (also known as) is the most common — used for trade names, abbreviated names, and concurrent alternative spellings. FKA (formerly known as) tracks names that were retired after a merger or rebrand. NFE (name formerly entered) records corrections to names that were previously entered with errors — distinct from FKA in that the prior form was a mistake rather than a legitimate historical name. PHONETIC captures transliterations of names from non-Latin scripts, where the romanized form varies across datasets and jurisdictions.

The alias table is the primary index for name-based entity lookup. A full-text search for “Huawei” hits the alias index before touching the canonical versions table — the alias index carries every normalized name variant, so a single B-tree scan returns all matching canonical_ids regardless of which source dataset originally published each variant.

Subscriber continuity guarantees

The central promise of the canonical_id layer is that a subscriber's stored reference remains valid across source identifier churn. When SAM.gov migrated from DUNS to UEI in 2022, every government contractor received a new primary identifier. For Hub subscribers, that event was invisible — the canonical_id for each contractor did not change, and the entity_id_mapping table was updated to map the new UEI to the same canonical_id that previously mapped from the DUNS number.

The lookup table that enables this is entity_id_mapping: a flat, indexed table that maps any source-specific identifier to the current canonical_id, regardless of how many times the source identifier has changed:

CREATE TABLE entity_id_mapping (
  source_system TEXT NOT NULL,
  source_id     TEXT NOT NULL,
  canonical_id  TEXT NOT NULL,
  mapped_at     TEXT NOT NULL,
  is_primary    INTEGER DEFAULT 0,
  PRIMARY KEY (source_system, source_id)
);

The is_primary flag marks the current authoritative identifier for a given source system. When a company gets a new UEI, the old DUNS-to-canonical_id mapping row is retained (for backward compatibility with historical data references) and a new UEI row is inserted withis_primary = 1. The old DUNS row is updated to is_primary = 0. Both rows continue to resolve to the same canonical_id — the distinction between primary and non-primary is only relevant for callers that need to know which identifier to use when writing back to the source system.

Lookup performance

The entity_id_mapping table is indexed by its primary key (source_system, source_id) — a composite B-tree index that resolves any source-system lookup in a single indexed seek. At the current corpus size of approximately 2.1 million entities across 208 datasets (averaging roughly 10,000 entities per dataset), the table holds approximately 2.1 million rows and fits well within D1's 10 GB per-database limit. The table is read-heavy and write-rare — writes occur only during ingestion batches, while reads occur on every entity resolution request.

Lookup pathp50 latencyMechanism
canonical_id direct0.3 msPrimary key lookup on entity_canonical_versions
source_id lookup1.8 msIndexed seek on entity_id_mapping, then canonical lookup
fuzzy name search12 msFTS5 scan over entity_aliases, then canonical lookup

The source_id lookup path adds approximately 1.5 ms over direct canonical_id access — one additional indexed seek against entity_id_mapping. Fuzzy name search is significantly more expensive because it invokes the FTS5 full-text index across all alias rows, but 12 ms is still fast enough for interactive API use. All three paths are measured at the D1 query layer; network round-trip to the Cloudflare edge adds 1–8 ms depending on the caller's geographic proximity to the nearest data center.

A real-world example: the Huawei entity lifecycle

The Huawei canonical_id illustrates the full lifecycle from initial ingestion through multi-source enrichment. The timeline:

Step 1 — Initial ingestion from OFAC SDN (2019-05-16). The OFAC SDN list is ingested and the parser encounters entry SDN-12345: “HUAWEI TECHNOLOGIES CO., LTD.”, entity type ORGANIZATION, country CN. The name normalization function strips punctuation and legal suffixes, producing the normalized form “HUAWEI TECHNOLOGIES”. No existing canonical_id matches this normalized name. A new canonical_id is generated: SHA-256('ORGANIZATION|HUAWEI TECHNOLOGIES|2019-05-16') — first 16 hex chars — prefix 'em_', yielding em_7f3a9c12b4d8e91a. The entity_id_mapping row(ofac_sdn, SDN-12345, em_7f3a9c12b4d8e91a) is inserted. Confidence score: 0.91 (single source, no cross-reference yet).

Step 2 — Discovery in SAM.gov (2020-03-01). The SAM.gov entity registration extract is ingested. A record for “Huawei Technologies Co. Ltd.” appears — mixed case, slightly different punctuation. Name normalization again produces “HUAWEI TECHNOLOGIES”. The alias table lookup finds a match against the existingem_7f3a9c12b4d8e91a record. Match confidence: 0.94 (normalized name is identical; country code CN matches). No merge is needed — this is a SOURCE_ADDED event. A new version of the canonical record is created, source_ids['sam_uei'] is populated, and the entity_id_mapping row (sam_uei, ABCDEF123456, em_7f3a9c12b4d8e91a) is inserted. Composite confidence rises to 0.97.

Step 3 — SEC EDGAR CIK linked (2020-09-15). EDGAR's company index is processed. CIK 0001645590, registrant name “Huawei Technologies Co. Ltd.”, is encountered. The EIN in the SAM.gov record matches the EIN in the EDGAR filing — this is an authoritative Pass 1 cross-reference. source_ids['sec_cik'] is populated. Confidence rises to 0.99.

Step 4 — FDA FEI and EPA FRS added (2021-02-10). FDA facility registration data and EPA FRS data are ingested. Both datasets carry the same normalized company name and the same country code. Both are linked via alias table lookup to em_7f3a9c12b4d8e91a. After both source_ids entries are added, the canonical record carries five source_id entries across five independent federal datasets — all linked to a single stable identifier that has not changed since the initial OFAC ingestion in 2019.

At this point, querying em_7f3a9c12b4d8e91a against the entity_id_mapping table returns five rows: one OFAC SDN entry, one SAM.gov UEI, one SEC CIK, one FDA FEI, and one EPA FRS ID — all resolved from a single canonical_id lookup in under 2 ms.


For the cross-agency entity bridge that joins records from all 208 datasets: Building the cross-agency regulatory entity graph: 50M+ records, one join →

For how entity names are normalized and fuzzy-matched across datasets: Name matching in federal regulatory data: aliases, subsidiaries, and sanctions evasion →

For how identifier normalization (CIK, UEI, LEI, DUNS, NPI) precedes canonical ID assignment: Entity ID normalization in the Federal Regulatory Data Hub: resolving CIK, UEI, LEI, DUNS, and NPI across 208 datasets →

For how canonical IDs are used in the compliance risk score: Compliance screening across 30+ federal enforcement lists: how the risk score works →