Technical writing

Building the Federal Regulatory Data Hub on Cloudflare D1: 50M+ Records at the Edge

March 10, 2025· 12 min read· AI Analytics

Regulatory dataCloudflare D1InfrastructureSQLite

We publish 208 federal datasets — SEC, FDA, OFAC, EPA, CFPB, DOJ, IRS, NHTSA, FAA, CFTC, FEMA, MSHA, OSHA, CMS, CDC, and more — via a REST API, an MCP server, and a sitemap index. The database behind all of it is Cloudflare D1: managed SQLite running at Cloudflare's edge. This post describes why we chose D1, how the schema is structured across 50M+ canonical records, how ingest works, and what breaks at this scale.

Why Cloudflare D1

The access pattern is read-heavy, query-simple, and globally distributed: a journalist in Nairobi, a compliance team in London, and a Claude agent running in a US datacenter all need the same query — “what enforcement actions does the EPA have against Exxon?” — to return in under 200ms. The options we evaluated:

Postgres on Fly.io or Railway: Good query expressiveness, single-region latency. Global edge routing would require a read-replica topology we'd have to manage.
PlanetScale / Neon: Both mature, but connection overhead from Cloudflare Workers requires the HTTP-based adapters, adding latency.
Turso (libSQL): Closest alternative — embedded SQLite with edge replication. We evaluated this seriously; D1 won on the operational simplicity of having one fewer external vendor and direct Cloudflare Workers binding without HTTP overhead.
Cloudflare D1: SQLite semantics, available as a direct Workers binding (no HTTP round-trip), wrangler d1 execute for schema migrations, and Cloudflare-managed replication. Latency from Workers: 1–5ms per query.

The tradeoff: D1 has a 10GB per-database hard limit (as of this writing), no full-text search (FTS5 is available but requires explicit virtual table creation), and no stored procedures. For our access patterns — point lookups, indexed range scans, and occasional aggregates — these constraints are acceptable.

Schema design: per-vertical tables

The 208 datasets span radically different record shapes: a NHTSA recall has a component-defect description and a recall date; an OFAC SDN entry has aliases, nationalities, and program codes; a CMS Medicare provider has an NPI, a specialty code, and a county. A single normalized “everything” table doesn't work at this diversity.

The schema is per-vertical: one SQLite table per dataset, with a shared canonical URL column that links back to the source record on the government site. Common pattern:

-- Example: OFAC SDN list
CREATE TABLE ofac_sdn (
  id           INTEGER PRIMARY KEY,
  uid          TEXT UNIQUE NOT NULL,    -- OFAC UID
  name         TEXT NOT NULL,
  type         TEXT,                    -- Individual / Entity / Vessel
  programs     TEXT,                    -- JSON array ["RUSSIA","SDN"]
  aliases      TEXT,                    -- JSON array
  nationalities TEXT,                   -- JSON array
  dob          TEXT,
  source_url   TEXT,
  updated_at   TEXT
);
CREATE INDEX ofac_sdn_name ON ofac_sdn(name);

-- Example: EPA enforcement cases
CREATE TABLE epa_enforcement (
  id             INTEGER PRIMARY KEY,
  case_number    TEXT UNIQUE,
  case_name      TEXT,
  respondent     TEXT,
  statute        TEXT,                  -- RCRA, CAA, CWA, ...
  penalty_amount INTEGER,
  settlement_date TEXT,
  case_status    TEXT,
  facility_state TEXT,
  source_url     TEXT,
  updated_at     TEXT
);
CREATE INDEX epa_enforcement_respondent ON epa_enforcement(respondent);
CREATE INDEX epa_enforcement_statute    ON epa_enforcement(statute);

This means queries can use tight, dataset-specific indexes without the overhead of a polymorphic schema. The cross-agency join layer lives above D1, in the Workers fetch logic, not in the database — described in more detail in the entity graph post.

The canonical record URL

Every record gets a canonical URL at https://api.ai-analytics.org/<dataset>/<record-id>. That URL resolves to the record in three content types via Accept header or extension:

# Browser (HTML + Schema.org JSON-LD in <script> tag)
curl https://api.ai-analytics.org/ofac-sdn/12345

# JSON (AI agents, jq)
curl -H 'Accept: application/json' https://api.ai-analytics.org/ofac-sdn/12345
curl https://api.ai-analytics.org/ofac-sdn/12345.json

# Markdown (LLM-friendly prose summary)
curl -H 'Accept: text/markdown' https://api.ai-analytics.org/ofac-sdn/12345
curl https://api.ai-analytics.org/ofac-sdn/12345.md

The canonical URL is what goes into the sitemap index — 35M+ entries, sharded into 5,000-URL sitemaps. Search engines and AI crawlers can discover every record. The_source envelope in every JSON response carries the original government URL, retrieval date, and license so downstream users can always trace provenance.

Ingest pipeline

Each dataset has a dedicated ingest script that runs on a Cloudflare Cron trigger (daily, offset per dataset to spread load). The ingest loop:

# Pseudocode — one dataset ingest cycle
async function ingest(dataset: DatasetConfig) {
  // 1. Fetch from primary government source
  const raw = await fetch(dataset.sourceUrl, {
    headers: { 'User-Agent': 'AI-Analytics-DataBot/1.0 (ai-analytics.org)' }
  });

  // 2. Parse (CSV / JSON / XML depending on source)
  const records = dataset.parser(await raw.text());

  // 3. Upsert into D1 — upsert not insert to preserve existing fields
  await db.prepare(
    `INSERT OR REPLACE INTO ${dataset.table} (...) VALUES (...)`
  ).bind(...recordValues).run();

  // 4. Update dataset_meta row (record count, last_updated)
  await db.prepare(
    `UPDATE dataset_meta SET record_count=?, updated_at=? WHERE id=?`
  ).bind(records.length, new Date().toISOString(), dataset.id).run();
}

Schema migrations

Government datasets change their schemas. The FDA added a new field to the warning letters export; the OFAC SDN XML schema version bumped; the NHTSA recall CSV added a component-type column. Migrations run via wrangler d1 execute against the production D1 database:

# Add a new column to an existing table
wrangler d1 execute ai-analytics-regulatory \
  --command "ALTER TABLE nhtsa_recalls ADD COLUMN component_type TEXT"

Rollbacks are manual; D1 doesn't have built-in migration tracking. We maintain a numbered migrations/ directory and a schema_migrations table to track applied migrations, similar to how Flyway or Goose work.

Government source reliability

Federal data sources are less reliable than their DNS records imply. Common failure modes we've hit in production:

Rate limiting. data.gov and some agency APIs rate-limit by IP. We rotate between Cloudflare Worker IPs and add exponential backoff.
Schema changes without notice. The CFTC COT report CSV has changed column ordering twice. We validate column count before upsert and alert on mismatch.
Weekend / holiday outages. Several agency APIs go fully offline on federal holidays (returns 503). The ingest script treats 5xx as a soft failure and retries the next day rather than overwriting with empty data.
FTP-only sources. Some legacy datasets (OSHA citations, certain MSHA exports) are still FTP. We run a fetch agent that mirrors the FTP tree to R2 storage and ingests from R2.
ZIP-inside-ZIP archives. The SAM.gov exclusion full extract is a ZIP containing multiple ZIP files, each containing CSV shards. The parser handles nested archive traversal.

D1 at 50M+ records: what we learned

Write bandwidth is the constraint, not read latency

D1 has a 1,000 write operations per second limit per database (in the current tier). With 50M+ records across 197 tables refreshed daily, the ingest pipeline processes roughly 400 records per second at peak — within the limit, but only because we stagger the ingest crons across a 6-hour window. Bulk ingest of new datasets (initial load of 1M+ records) requires rate-limiting the upsert loop explicitly.

D1's query planner is SQLite's query planner

Because D1 is SQLite under the hood, the query planner behaves identically to SQLite. This is mostly good — SQLite's planner is well-understood. One gotcha: SQLite prefers full-table scans over index scans for small tables. Datasets with fewer than ~50k rows often get full scanned even with an index. At 35M+ total records spread across 197 tables, this is only a problem for the smallest staging datasets.

The 10GB limit is real

We started with a single D1 database and hit 8.4GB with ~160 datasets. The solution was vertical sharding: one D1 database per major agency cluster (SEC/EDGAR, FDA, EPA/enforcement, OFAC/sanctions, healthcare). The entity_master bridge lives in its own D1 database. The Workers layer routes queries to the right D1 binding based on the dataset prefix. Cloudflare allows up to 50,000 D1 databases per account, so this approach has significant headroom.

Full-text search via FTS5

For datasets where free-text search matters (DOJ press releases, FDA warning letters, OFAC aliases), we create FTS5 virtual tables alongside the primary table:

-- FTS5 virtual table for DOJ press releases
CREATE VIRTUAL TABLE doj_press_fts USING fts5(
  title, body, district,
  content='doj_press', content_rowid='id',
  tokenize='porter ascii'
);

-- Keep in sync via triggers
CREATE TRIGGER doj_press_fts_insert AFTER INSERT ON doj_press BEGIN
  INSERT INTO doj_press_fts(rowid, title, body, district)
  VALUES (new.id, new.title, new.body, new.district);
END;

FTS5 with Porter stemming handles the most common free-text queries well. It doesn't support vector similarity search — we run that separately via a FAISS index on the ingest worker when relevant.

Serving the API: Cloudflare Workers + Next.js

The API runs as a Cloudflare Worker with D1 bindings. The frontend (ai-analytics.org) is Next.js 14 static export on Cloudflare Pages, fetching from the API at build time (for static data) or at runtime via theLiveRegulatoryWidget client component (for live counters).

The MCP server (https://api.ai-analytics.org/mcp) is a second Cloudflare Worker that wraps the same D1 bindings and exposes 38+ MCP tools over JSON-RPC / Streamable HTTP. Claude, GPT, and other agent frameworks can query any of the 208 datasets directly without writing a custom API integration.

What comes next

The D1 foundation is stable. The remaining work is higher in the stack:

Completing the entity bridge to make cross-agency joins automatic rather than dataset-specific.
Vector-similarity search across record text fields (D1 + external FAISS, or waiting for D1-native vector support).
Pushing the remaining 15+ staged datasets into production D1 (currently in the frontend catalog but not live in the API).
Streaming ingest for near-real-time datasets (EDGAR EDGAR-Online filings, CISA KEV additions) that currently batch-refresh daily.

For the entity bridge that makes cross-agency joins automatic: Building the cross-agency regulatory entity graph: 50M+ records, one join →

For how the five federal identifier types (CIK, UEI, LEI, DUNS, NPI) are normalized into the entity_master bridge: Entity ID normalization in the Federal Regulatory Data Hub →

For the REST, MCP, and JSON-LD API layer on top of this D1 database: The Federal Regulatory API: REST, MCP, and JSON-LD for 208 federal datasets →

For the compliance risk score that queries all 30+ enforcement lists via these tables: Compliance screening across 30+ federal enforcement lists: how the risk score works →

For how the Worker query layer routes requests across these 8 D1 shards at the edge: The Federal Regulatory Data Hub query layer: routing 50M+ records at the Cloudflare edge →

For Voidly's analogous storage tiering — hot/warm/cold retention policy for 2.2B measurement rows in TimescaleDB: Voidly's measurement retention policy: hot, warm, and cold tiers for 2.2B probe results →