Technical writing

Full-text search across federal regulatory narratives: SQLite FTS5 on Cloudflare D1

March 17, 2025· 10 min read· AI Analytics

Regulatory dataCloudflare D1InfrastructureSQLite

The Federal Regulatory Data Hub's storage layer runs on Cloudflare D1 — vertically sharded SQLite at the edge, described in the D1 architecture post. That architecture handles structured data well: indexed lookups by CIK, UEI, penalty amount, case date. But a significant portion of regulatory data is structurally narrative, and narrative data requires a different search approach. This post describes how we implemented full-text search across those datasets using SQLite's FTS5 extension running on Cloudflare D1.

Why free-text search is unavoidable in regulatory data

Several of the most important regulatory datasets are written primarily in prose. DOJ press releases describe enforcement actions — the charging theory, the statute, the individual defendants, the settlement terms — in narrative paragraphs rather than structured fields. FDA warning letters are regulatory correspondence: they name the entity under review, describe the violations found during inspection, and cite the specific CFR provisions, but the operative content is in flowing prose. CFPB consumer complaint narratives average roughly 180 words per complaint and are almost entirely unstructured. SEC litigation releases describe fraud schemes, name co-conspirators, and enumerate charges in paragraph form.

The entity name problem makes free-text search especially necessary. A company like JPMorgan Chase appears across these narrative datasets in at least half a dozen surface forms: “JPMorgan Chase”, “J.P. Morgan”, “JPMorgan”, “J.P. Morgan Chase & Co.”, “JPMorgan Chase Bank, N.A.”, and “Chase Bank”. A DOJ press release from 2014 may refer to the entity by a name that does not appear anywhere in the SEC EDGAR filing for the same enforcement event. You cannot reliably reconstruct a company's enforcement history from narrative datasets without full-text search and alias expansion — an exact-match filter on a name column will miss the majority of relevant records.

SQLite FTS5 with unicode61

SQLite's FTS5 extension provides an inverted index full-text search engine built directly into the database engine. FTS5 creates a virtual table that stores an inverted index alongside the base data; queries against the virtual table return ranked results from the inverted index without touching the base table for the initial filter. Cloudflare D1 includes FTS5 — it is compiled in by default.

The tokenizer choice matters. The default ascii tokenizer handles only US-ASCII text. Federal regulatory data contains non-ASCII entity names: French, German, Chinese, and Arabic company names appear in OFAC SDN records and FinCEN enforcement actions. The unicode61 tokenizer, included since SQLite 3.26.0 and available in D1, handles Unicode normalization and case folding correctly. It folds accented characters to their base forms, handles Unicode word boundary detection, and applies case-insensitive matching across the full Unicode range. For regulatory data, unicode61 is the correct default.

The content and content_rowid arguments create a “content table” — an FTS5 configuration where the virtual table holds only the inverted index, not a copy of the text. The base table remains the source of truth; FTS5 stores only the index. This halves the storage overhead compared to a standalone FTS5 table, at the cost of requiring triggers to keep the index synchronized:

-- FTS5 virtual table for DOJ enforcement press releases
-- unicode61 tokenizer for correct handling of non-ASCII entity names
CREATE VIRTUAL TABLE doj_enforcement_fts USING fts5(
  title,
  enforcement_summary,
  body,
  district,
  content='doj_enforcement',    -- base table name
  content_rowid='id',           -- rowid column in the base table
  tokenize='unicode61 remove_diacritics 1'
);

-- FTS5 virtual table for FDA warning letters
CREATE VIRTUAL TABLE fda_warning_letters_fts USING fts5(
  subject,
  issuing_office,
  violations_summary,
  full_text,
  content='fda_warning_letters',
  content_rowid='id',
  tokenize='unicode61 remove_diacritics 1'
);

The remove_diacritics 1 option instructs unicode61 to strip diacritical marks when tokenizing — “Société Générale” and “Societe Generale” both tokenize to the same index terms. This is correct for entity name search; it would be wrong for a literary corpus where diacritics carry meaning.

BM25 scoring

FTS5 uses BM25 (Best Match 25) for relevance ranking. BM25 is a probabilistic retrieval model that scores each document based on term frequency within the document, inverse document frequency across the corpus, and a length normalization factor that prevents long documents from automatically outscoring short ones that are more densely relevant. FTS5 exposes the BM25 score through the implicit rank column on any FTS5 query.

SQLite convention: the rank value is returned as a negative number. A lower (more negative) rank value means a more relevant document. Ordering by rank ascending gives most-relevant-first:

-- Simple BM25-ranked FTS5 query
SELECT
  doj_enforcement.id,
  doj_enforcement.case_date,
  doj_enforcement.title,
  doj_enforcement.district,
  doj_enforcement_fts.rank
FROM doj_enforcement_fts
JOIN doj_enforcement ON doj_enforcement.id = doj_enforcement_fts.rowid
WHERE doj_enforcement_fts MATCH 'JPMorgan'
ORDER BY doj_enforcement_fts.rank
LIMIT 20;

The default BM25 weights all indexed columns equally. For regulatory data, the title and enforcement summary are more signal-dense than the full press release body — a match in the title almost always means the document is primarily about that entity, while a match only in the body might be a passing reference. FTS5's bm25() function accepts per-column weights to express this:

-- BM25 with per-column weights: title 10×, enforcement_summary 5×, body 1×
-- Columns are listed in the order they appear in the CREATE VIRTUAL TABLE statement
SELECT
  doj_enforcement.id,
  doj_enforcement.case_date,
  doj_enforcement.title,
  bm25(doj_enforcement_fts, 10.0, 5.0, 1.0, 1.0) AS weighted_rank
FROM doj_enforcement_fts
JOIN doj_enforcement ON doj_enforcement.id = doj_enforcement_fts.rowid
WHERE doj_enforcement_fts MATCH 'JPMorgan'
ORDER BY weighted_rank
LIMIT 20;

The four weight arguments correspond to the four indexed columns in declaration order:title (10×), enforcement_summary (5×),body (1×), and district (1×). A title match contributes ten times as much to the relevance score as a body match. In practice, this substantially reduces noise from documents that mention an entity only in passing within a long press release narrative.

The highlight() and snippet() functions

Relevance ranking tells users which documents matched; it doesn't tell them why. Users querying a company's enforcement history want to see the sentence or paragraph that contains the match — the specific charge, the dollar amount, the named defendant — not just a link to the document. FTS5 provides two functions for this: highlight() andsnippet().

highlight(table, col_index, open_tag, close_tag) returns the full text of the specified column with every matched term wrapped in the provided HTML tags.snippet(table, col_index, open_tag, close_tag, ellipsis, num_tokens) returns a short excerpt (up to num_tokens tokens) centered on the best match within the column, with matched terms similarly wrapped. For API responses, snippet is almost always the right choice — returning a 40-word excerpt is more useful than returning a 2,000-word press release with matches bolded throughout.

-- Full query: BM25-ranked DOJ enforcement results with highlighted excerpts
-- Useful for the compliance API: returns WHY each document matched, not just that it matched
SELECT
  doj_enforcement.id,
  doj_enforcement.case_date,
  doj_enforcement.district,
  doj_enforcement.source_url,
  highlight(doj_enforcement_fts, 0, '<b>', '</b>')   AS title_highlighted,
  snippet(
    doj_enforcement_fts,
    2,              -- col_index: body column (0-indexed, order from CREATE VIRTUAL TABLE)
    '<mark>',       -- open tag wrapping matched terms
    '</mark>',      -- close tag
    '...',          -- ellipsis between non-contiguous excerpts
    10              -- max tokens in the excerpt window
  ) AS body_snippet,
  bm25(doj_enforcement_fts, 10.0, 5.0, 1.0, 1.0) AS score
FROM doj_enforcement_fts
JOIN doj_enforcement ON doj_enforcement.id = doj_enforcement_fts.rowid
WHERE doj_enforcement_fts MATCH '"JPMorgan Chase"'
ORDER BY score
LIMIT 10;

The col_index in highlight() and snippet() is 0-based and refers to columns in the order they appear in the CREATE VIRTUAL TABLE declaration — not the order of the base table. Index 0 is title, index 1 is enforcement_summary, index 2 is body. The <mark> and </mark> tags in the API response are passed through to the client and rendered as highlighted spans in the compliance search UI. The API documentation explicitly instructs callers to treat these as trusted markup only from this endpoint — the tags are constructed by FTS5, not from user input.

Query construction and special characters

FTS5 has its own query syntax that is distinct from SQL. Phrase queries require double quotes:"JPMorgan Chase". Boolean OR is written explicitly:JPMorgan OR Chase. Exclusion uses a minus prefix:JPMorgan -fraud. Column-restricted queries use a prefix notation:title:JPMorgan. These are FTS5 query operators, not SQL operators, and they must be in the string passed to the MATCH operator.

Entity names in regulatory data create several construction hazards. Hyphens in names like “J.P. Morgan” can be parsed as subtraction in some tokenizer modes. Periods are stripped by unicode61, so “J.P. Morgan” tokenizes to the same terms as “JP Morgan”, but a naive query string that passes the raw name through may fail with a syntax error if the dots appear in unexpected positions. Ampersands — common in fund names like “Lazard Frères & Co.” — are not FTS5 operators but require the surrounding name to be phrase-quoted to prevent tokenization issues. Double quotes within an entity name must be escaped by doubling them.

// TypeScript — build a safe FTS5 query string from an entity name
// Normalizes the name, handles special characters, and phrase-quotes the result
function buildFts5Query(entityName: string): string {
  // Step 1: strip common legal suffixes that add noise without adding signal
  const LEGAL_SUFFIXES = /(Inc.?|Corp.?|LLC|Ltd.?|Co.?|Company|Corporation|Limited|LP|LLP|PLC|NA|FSB|Bank)/gi;
  let normalized = entityName.replace(LEGAL_SUFFIXES, '').trim();

  // Step 2: collapse punctuation that FTS5 would tokenize ambiguously
  // Dots between initials (J.P.) → retained; dots at end of word → removed
  // Hyphens in compound names → replaced with space so each part is indexed
  normalized = normalized
    .replace(/.s+/g, ' ')   // ". " → " " (sentence-ending dots)
    .replace(/&/g, 'and')      // & → and (ampersand)
    .replace(/-/g, ' ')        // - → space (hyphenated names)
    .replace(/s+/g, ' ')
    .trim();

  // Step 3: escape any double-quote characters inside the name
  // FTS5 phrase queries use double quotes; internal quotes must be doubled
  const escaped = normalized.replace(/"/g, '""');

  // Step 4: wrap in double quotes for phrase search
  // This ensures multi-word names are matched as a phrase, not as individual terms
  return `"${escaped}"`;
}

// Examples:
// buildFts5Query('J.P. Morgan Chase & Co.')   → '"JP Morgan Chase and"'
// buildFts5Query('Wells Fargo Bank, N.A.')     → '"Wells Fargo N A"'
// buildFts5Query('Lazard Frères & Co. LLC')    → '"Lazard Frères and"'
// buildFts5Query('He said "hello"')            → '"He said ""hello"""'

The normalization deliberately does not strip the dot from initial sequences like “J.P.” because unicode61's tokenizer treats the period as a word boundary and handles the tokenization correctly. Stripping it before FTS5 would cause no harm, but leaving it lets the tokenizer handle the edge case consistently. What matters is that the same normalization is applied when aliases are indexed and when queries are constructed — so that “JP Morgan” and “J.P. Morgan” produce the same FTS5 tokens on both sides of the match.

Alias expansion in FTS5 queries

The entity ID normalization pipeline — described separately in the normalization post — maintains an entity_aliases table with all known name variations for each entity in the corpus. FTS5 phrase search is exact within the tokenized form; it will not automatically find “J.P. Morgan” when you search for “JPMorgan Chase”. The solution is to expand the FTS5 query before submission by including all known aliases as additional OR terms.

The Cloudflare Worker handling an entity name search first queries the alias table for the resolved entity, then builds an expanded FTS5 query joining all aliases with OR:

// TypeScript — Cloudflare Worker alias expansion for FTS5 queries
// Runs in the Worker before any FTS5 query is submitted to D1

async function buildExpandedFts5Query(
  entityName: string,
  entityId: string | null,
  db: D1Database,
): Promise<string> {
  const baseQuery = buildFts5Query(entityName);

  // If we couldn't resolve to an entity_id, fall back to the single phrase query
  if (!entityId) return baseQuery;

  // Fetch all active aliases for this entity from the bridge database
  const aliases = await db
    .prepare(
      `SELECT alias_text
         FROM entity_aliases
        WHERE entity_id = ?
          AND (valid_to IS NULL OR valid_to > date('now'))
        ORDER BY alias_type`
    )
    .bind(entityId)
    .all<{ alias_text: string }>();

  if (!aliases.results.length) return baseQuery;

  // Build one phrase clause per alias, then join with OR
  // Example output for JPMorgan Chase entity:
  //   "JPMorgan Chase" OR "JP Morgan" OR "JPMorgan" OR "J P Morgan Chase"
  const aliasClauses = aliases.results.map(row =>
    buildFts5Query(row.alias_text)
  );

  // Deduplicate (buildFts5Query may normalize different aliases to the same phrase)
  const unique = [...new Set([baseQuery, ...aliasClauses])];

  return unique.join(' OR ');
}

// Usage in the compliance search handler:
//
// const entityMatch = await resolveEntity(rawName, bridgeDb);
// const fts5Query = await buildExpandedFts5Query(
//   rawName,
//   entityMatch?.entity_id ?? null,
//   bridgeDb,
// );
// → '"JPMorgan Chase" OR "JP Morgan" OR "JPMorgan" OR "J P Morgan Chase and Co"'

The expanded query is then passed as the MATCH argument in the FTS5 query. A single FTS5 query with an OR-expanded alias list is substantially more efficient than running separate queries per alias and merging results in application code — the FTS5 inverted index handles the OR expansion in a single pass over the index.

Cross-dataset search fan-out

A single compliance search must cover all relevant narrative datasets simultaneously: DOJ press releases, FDA warning letters, CFPB complaint narratives, SEC litigation releases, and EPA ECHO enforcement descriptions. These datasets live in separate D1 shards — the DOJ and SEC data are in the enforcement shard, FDA in the healthcare shard, CFPB and EPA in their respective verticals.

The Worker submits concurrent FTS5 queries to each relevant shard using Promise.all, collects BM25-ranked results per shard, then merges across shards and re-ranks by BM25 score. Cross-shard BM25 scores are not directly comparable because each shard has its own IDF statistics — a term that appears in 1% of DOJ press releases has a different IDF than the same term appearing in 1% of CFPB complaints. We normalize scores within each shard to [0, 1] before merging:

// TypeScript — cross-dataset FTS5 fan-out in the Cloudflare Worker
// Submits concurrent queries to all narrative dataset shards and merges results

interface FtsResult {
  id: number;
  dataset: string;
  case_date: string;
  title: string;
  source_url: string;
  snippet: string;
  raw_score: number;       // raw BM25 score from FTS5 (negative)
  normalized_score: number; // normalized to [0, 1] within shard for cross-shard merge
}

async function searchNarrativeDatasets(
  fts5Query: string,
  env: Env,
): Promise<FtsResult[]> {
  // Shards and their D1 bindings — each binding is a separate D1 database
  const shardQueries: Array<Promise<FtsResult[]>> = [
    queryEnforcementShard(fts5Query, env.DB_ENFORCEMENT),  // DOJ + SEC
    queryHealthcareShard(fts5Query, env.DB_HEALTHCARE),    // FDA
    queryConsumerShard(fts5Query, env.DB_CONSUMER),        // CFPB
    queryEnvironmentShard(fts5Query, env.DB_ENVIRONMENT),  // EPA ECHO
    querySecuritiesShard(fts5Query, env.DB_SECURITIES),    // SEC litigation releases
  ];

  // Fan-out: all five shards queried concurrently
  const shardResults = await Promise.all(shardQueries);

  // Flatten and normalize scores within each shard before merging
  const allResults: FtsResult[] = [];
  for (const results of shardResults) {
    if (!results.length) continue;

    // BM25 scores are negative; most relevant is most negative
    const minScore = Math.min(...results.map(r => r.raw_score));
    const maxScore = Math.max(...results.map(r => r.raw_score));
    const range = maxScore - minScore || 1; // prevent division by zero

    for (const result of results) {
      // Normalize: most relevant (minScore) → 1.0, least relevant (maxScore) → 0.0
      result.normalized_score = 1 - (result.raw_score - minScore) / range;
      allResults.push(result);
    }
  }

  // Sort by normalized score descending (most relevant first) and return top 50
  return allResults
    .sort((a, b) => b.normalized_score - a.normalized_score)
    .slice(0, 50);
}

// Helper: query the enforcement shard (DOJ press releases + SEC litigation)
async function queryEnforcementShard(
  fts5Query: string,
  db: D1Database,
): Promise<FtsResult[]> {
  const sql = `
    SELECT
      doj_enforcement.id,
      'doj_enforcement' AS dataset,
      doj_enforcement.case_date,
      doj_enforcement.source_url,
      highlight(doj_enforcement_fts, 0, '<mark>', '</mark>') AS title,
      snippet(doj_enforcement_fts, 2, '<mark>', '</mark>', '...', 12) AS snippet,
      bm25(doj_enforcement_fts, 10.0, 5.0, 1.0, 1.0) AS raw_score
    FROM doj_enforcement_fts
    JOIN doj_enforcement ON doj_enforcement.id = doj_enforcement_fts.rowid
    WHERE doj_enforcement_fts MATCH ?
    ORDER BY raw_score
    LIMIT 25
  `;
  const result = await db.prepare(sql).bind(fts5Query).all<FtsResult>();
  return result.results;
}

Performance benchmarks

FTS5 query performance on Cloudflare D1 is fast for typical compliance search workloads. Observed production latency with the D1 Workers binding (no HTTP round-trip):

Query type	Table size	p50	p99
Single-term FTS5 phrase	500K rows (DOJ)	8ms	22ms
4-alias OR-expanded phrase	500K rows (DOJ)	14ms	38ms
Single-shard with highlight() + snippet()	500K rows (DOJ)	11ms	29ms
Cross-dataset fan-out (5 shards, Promise.all)	all narrative datasets	45ms	110ms
Cross-dataset fan-out + alias expansion	all narrative datasets	58ms	140ms

The cross-dataset p50 of 45ms is dominated by the slowest shard in the fan-out, not the total work — concurrent queries mean the five shards are queried in parallel rather than serially. The p99 of 110ms is acceptable for an interactive compliance search and is well within Cloudflare's Worker CPU time budget.

FTS5 index size is a D1 capacity concern. Each FTS5 virtual table adds an inverted index that typically runs 20–30% of the base table size. A DOJ enforcement table with 500K press releases averaging 800 words each stores roughly 400MB of text; its FTS5 index adds another 80–120MB. Cloudflare D1's per-database storage limit means FTS5 indexes count against the same cap as base data. For capacity planning: assume 25% overhead per FTS5 virtual table and factor that into shard sizing when approaching the D1 per-database limit.

Keeping FTS5 indexes current

FTS5 content tables do not automatically update when the base table changes. Every INSERT, UPDATE, or DELETE on the base table must be reflected in the FTS5 virtual table manually. The correct pattern is database triggers, which ensure the FTS5 index is updated atomically with any base table modification:

-- Triggers to keep doj_enforcement_fts synchronized with doj_enforcement
-- Three triggers required: after insert, after delete, after update

-- After INSERT: add the new row to the FTS5 index
CREATE TRIGGER doj_enforcement_ai
AFTER INSERT ON doj_enforcement BEGIN
  INSERT INTO doj_enforcement_fts(
    rowid, title, enforcement_summary, body, district
  ) VALUES (
    new.id, new.title, new.enforcement_summary, new.body, new.district
  );
END;

-- After DELETE: remove the row from the FTS5 index
-- FTS5 content tables use rowid=-1 + the content to signal deletion
CREATE TRIGGER doj_enforcement_ad
AFTER DELETE ON doj_enforcement BEGIN
  INSERT INTO doj_enforcement_fts(
    doj_enforcement_fts, rowid, title, enforcement_summary, body, district
  ) VALUES (
    'delete', old.id, old.title, old.enforcement_summary, old.body, old.district
  );
END;

-- After UPDATE: delete the old index entry, insert the new one
CREATE TRIGGER doj_enforcement_au
AFTER UPDATE ON doj_enforcement BEGIN
  INSERT INTO doj_enforcement_fts(
    doj_enforcement_fts, rowid, title, enforcement_summary, body, district
  ) VALUES (
    'delete', old.id, old.title, old.enforcement_summary, old.body, old.district
  );
  INSERT INTO doj_enforcement_fts(
    rowid, title, enforcement_summary, body, district
  ) VALUES (
    new.id, new.title, new.enforcement_summary, new.body, new.district
  );
END;

FTS5 also provides an optimize command that merges the internal index segments into a single segment, improving query performance and reducing index size after large batches of inserts. FTS5 uses a multi-segment design where each batch of writes creates a new segment; over time, many small segments degrade query performance as FTS5 must merge results across all segments at query time. The optimize command merges all segments into one:

-- Run weekly via Cloudflare Cron to consolidate FTS5 index segments
-- Typically executes in 2–10 seconds on a 500K-row index
INSERT INTO doj_enforcement_fts(doj_enforcement_fts) VALUES ('optimize');
INSERT INTO fda_warning_letters_fts(fda_warning_letters_fts) VALUES ('optimize');

The optimize command is run weekly on all narrative FTS5 tables via a dedicated Cloudflare Cron trigger scheduled at 02:00 UTC Sunday. It runs sequentially across all FTS5 tables rather than concurrently — optimize is a write-heavy operation that briefly locks the FTS5 index and should not be interleaved with active queries. The cron is scheduled for the lowest-traffic window.

For the D1 architecture that hosts these FTS5 indexes: Building the Federal Regulatory Data Hub on Cloudflare D1: 50M+ records at the edge →

For entity ID normalization that complements FTS5 entity name search: Entity ID normalization in the Federal Regulatory Data Hub: resolving CIK, UEI, LEI, DUNS, and NPI →

For the query layer that wraps FTS5 in the REST API: The Federal Regulatory Data Hub query layer: routing 50M+ records at the Cloudflare edge →

For the OFAC SDN integration, which uses FTS5 for entity name screening: OFAC SDN integration: screening entities across the Federal Regulatory Data Hub →