The Food Safety System: Joining CDC Outbreaks with FSIS and FDA Recalls

A cluster of people in three states falls ill with the same strain of Listeria. Weeks later, a deli-meat plant in a fourth state pulls thousands of pounds of product off the shelf. Those two events are the same story—but they live in two different federal databases run by two different agencies, and a third agency wrote down the illnesses that connected them. The US food-safety system is split three ways, and the chain from a detected outbreak to the recall that finally removes the contaminated food only comes together when you join the data the agencies keep apart.

This article covers the three federal food-safety datasets and how to assemble them into one view: the jurisdictional split that puts meat, poultry, and egg products under the USDA and essentially all other food under the FDA, and why a complete recall picture therefore requires both; the CDC's role as the epidemiology that sits across both recall agencies; the CDC National Outbreak Reporting System and the fields it records; the FSIS recall feed and the openFDA food-enforcement endpoint and their respective grains; recall classes and the pathogens—Listeria, Salmonella, and E. coli—that drive most of the action; how the join runs by pathogen, product, firm, and time window; how the Food Safety Modernization Act shifted the FDA toward prevention and what that did to recall timeliness; a Python workflow that pulls recalls by pathogen and lines them up against the outbreaks for the same period while comparing the FSIS and FDA streams; and the caveats of stitching together three agencies that were never designed to be joined.

Three datasets, one system

Food safety in the United States is not administered by a single agency, and the data reflects that. Three federal datasets, maintained by three agencies with three distinct missions, together describe the system. The Centers for Disease Control and Prevention (CDC) is the epidemiology: it counts the people who got sick, identifies the pathogen, and works backward to the food that carried it. The USDA Food Safety and Inspection Service (FSIS) and the Food and Drug Administration (FDA) are the two recall agencies that act on contamination by getting the product off the market. The CDC tells you that something went wrong and what it was; the two recall agencies tell you what was pulled and when. None of the three is complete on its own, and the most useful questions in food-safety policy—how often an outbreak leads to a recall, how long that takes, which foods and pathogens cause the most harm—can only be answered by joining them.

In our database the system is stored as three tables that mirror the three sources: cdc_foodborne_outbreaks for the CDC's outbreak surveillance, usda_fsis_recalls for the FSIS recalls of meat, poultry, and egg products, and fda_food_enforcement for the FDA enforcement reports covering everything else. Each carries product, firm, pathogen, and date fields, which is what makes the join tractable: the work is aligning the pathogens and products and matching the firms across the three, rather than parsing three incompatible sources from scratch. The columns below show the shared backbone—the fields that appear, under different names, across all three and therefore serve as the join keys:

-- cdc_foodborne_outbreaks (one row per reported outbreak)
year, month            -- when the outbreak occurred
state                  -- primary reporting state
etiology               -- pathogen / agent (e.g. Salmonella, Listeria)
food_vehicle           -- implicated food, where identified
setting                -- restaurant, private home, institution, etc.
illnesses              -- confirmed + probable case count
hospitalizations       -- count hospitalized
deaths                 -- count of deaths

-- usda_fsis_recalls (meat, poultry, processed-egg products)
recall_number          -- FSIS recall identifier
recalling_firm         -- establishment issuing the recall
product_description     -- product name and pack size
reason_for_recall       -- often a pathogen (Listeria, Salmonella, E. coli)
recall_class           -- Class I / II / III health-hazard tier
recall_date            -- date the recall was issued

-- fda_food_enforcement (all other food + cosmetics)
recall_number          -- FDA recall identifier
recalling_firm         -- firm initiating the recall
product_description     -- product name and codes
reason_for_recall       -- free-text cause, frequently a pathogen
classification         -- Class I / II / III
report_date, recall_initiation_date  -- key dates

The load-bearing observation is that no single column joins the three cleanly. There is no shared primary key—a CDC outbreak has no recall number, and a recall carries no outbreak identifier. The join is instead built from the soft keys the three sources have in common: the pathogen (the CDC's etiology against the recalls' reason_for_recall), the product or food (the CDC's food_vehicle against the recalls' product_description), the firm (matched across the two recall feeds and, where the CDC names it, against the outbreak), and the time window. Because these are fuzzy text and date matches rather than exact keys, the quality of the assembled picture depends on how carefully the pathogens and products are normalized—a point the caveats section returns to.

The jurisdictional split: why two recall agencies

The single most important structural fact about US food-safety data—and the one that makes the FSIS table indispensable alongside the FDA table—is the jurisdictional split between the two recall agencies. The USDA Food Safety and Inspection Service regulates meat, poultry, and processed-egg products. The Food and Drug Administration regulates essentially all other food—produce, seafood, dairy, shell eggs, packaged and processed goods, bottled water, dietary supplements—plus cosmetics. The line is statutory, rooted in the Federal Meat Inspection Act, the Poultry Products Inspection Act, and the Egg Products Inspection Act on the USDA side, and the Federal Food, Drug, and Cosmetic Act on the FDA side.

The split produces genuine boundary oddities that anyone working the data should know. A cheese pizza is an FDA product; a pepperoni pizza is, because of the meat topping, an FSIS product. A closed-faced meat sandwich falls to the USDA; an open-faced one can fall to the FDA. Shell eggs are FDA; processed egg products—liquid, frozen, or dried—are FSIS. These are not trivia: they determine which database a given recall lands in. The two agencies also operate on different inspection philosophies. FSIS maintains continuous, on-site inspection: federal inspectors are physically present in slaughter and processing establishments every day they operate, a mandate unique in food regulation. The FDA, regulating a vastly larger and more varied universe of facilities, inspects on a risk-based, periodic schedule and cannot be everywhere at once.

The practical consequence for analysis is that the FDA side is far larger than the FSIS side. Meat, poultry, and eggs are a meaningful slice of the diet but a small fraction of the food universe; the FDA-regulated remainder—all the produce, seafood, dairy, and packaged goods—generates many times more recalls. Any comparison of the two streams that does not account for this asymmetry will mistake the size of the jurisdiction for the diligence of the agency. The correct reading is not “the FDA recalls more, so the FDA has a bigger problem,” but that the FDA covers more of what people eat and therefore touches more of the contamination. A complete food-recall picture genuinely requires both datasets; dropping either one silently removes an entire category of food from the analysis.

The CDC sits across both: outbreak surveillance

Where the two recall agencies divide the food universe between them, the CDC sits across both. Its mission is not regulatory—the CDC does not recall food—but epidemiological: to detect that people are getting sick, to identify what is making them sick, and to trace it to a source so the regulatory agencies can act. A foodborne outbreak does not announce which agency's jurisdiction it belongs to; a Salmonella cluster might trace to ground beef (FSIS) or to cantaloupe (FDA), and the CDC investigates both the same way. This is precisely why the CDC's record is the natural spine of a joined analysis: it is the one dataset that does not respect the meat-versus-everything-else boundary, so it can link an illness to whichever side of the recall system ultimately responds.

The CDC's outbreak surveillance runs through the National Outbreak Reporting System (NORS), a web-based platform through which state, local, and territorial health departments report outbreaks of enteric illness, including foodborne ones, to the CDC. The grain is the outbreak—a single event in which two or more people get the same illness from a common source—not the individual case. For each outbreak NORS records the etiology (the pathogen or agent, where it was determined), the implicated food vehicle (the food identified as the source, where investigators could identify one), the setting in which exposure occurred (a restaurant, a private home, a banquet, an institution such as a school or nursing home), the case count, and crucially the hospitalization and death counts. Those last two fields are what let the data weigh outbreaks by severity rather than merely by count: an outbreak with two hospitalizations and a death is a different event from one with two cases of mild gastroenteritis, even though both are one row.

Two features of NORS shape every analysis built on it. First, the food vehicle is frequently unidentified. Tracing an outbreak to a specific food is hard—people misremember what they ate, leftovers are gone, and the contaminated lot may already be consumed—so a substantial share of outbreaks are recorded with an unknown or only broadly categorized vehicle. An outbreak with no named food cannot be matched to a specific recall, which structurally limits how many outbreak-to-recall links the data can ever support. Second, outbreaks are a small, biased sample of all foodborne illness. Most foodborne illness is sporadic—a single person, never connected to anyone else—and never becomes an outbreak at all. NORS captures the clustered, investigated tip of a very large iceberg, which is the right denominator for some questions and badly wrong for others.

The recall feeds: FSIS and openFDA

On the recall side, the two agencies publish their data through separate public, key-free channels. FSIS exposes its recalls through a recall API on fsis.usda.gov that returns the open and historical recall set as structured JSON. Each FSIS recall record names the recalling establishment, describes the product and its pack sizes, states the reason for the recall, assigns a recall class, and carries the recall date, along with the affected states and, increasingly, product label images and lot detail. Because FSIS maintains continuous in-plant inspection, many of its recalls originate from the agency's own findings—a positive pathogen test on a product sample, or a process-control failure—in addition to firm-initiated recalls.

The FDA publishes its recalls through the openFDA food-enforcement endpoint at api.fda.gov/food/enforcement.json, drawn from the FDA Enforcement Report. Each row is a recall event and carries the recalling firm, a product description with codes, the free-text reason_for_recall, a classification (the FDA's recall class), the distribution pattern, the status of the recall, and several dates—the recall_initiation_date (when the firm began the recall) and the report_date (when the FDA published it) being the two most analytically important, because the gap between them is a measure of timeliness. openFDA permits a useful volume of requests without a key and a much higher ceiling with a free registered key, and its search and count parameters allow server-side filtering and tallies that make pathogen-by-pathogen queries efficient.

Both feeds share the recall classification scheme, which is central to reading the data by severity. A Class Irecall denotes a reasonable probability that the product will cause serious adverse health consequences or death—the tier that Listeria contamination, undeclared major allergens, and certain E. coli and Salmonella findings almost always fall into. Class II denotes a remote probability of serious harm or a probability of temporary, reversible harm. Class III denotes a product unlikely to cause adverse health consequences—a labeling or quality defect. Filtering recalls to Class I is the standard way to isolate the genuinely dangerous events from the long tail of minor ones, and it is the class most likely to correspond to a CDC outbreak.

The pathogens that drive the system

Across all three datasets, a small set of pathogens accounts for a disproportionate share of both illness and recalls, and they are the natural axis along which the join runs. Three dominate the serious end.

Listeria monocytogenes causes relatively few illnesses but a strikingly high share of deaths, because listeriosis is severe in pregnant women, newborns, the elderly, and the immunocompromised, and because Listeria grows at refrigeration temperatures—making ready-to-eat foods like deli meats, soft cheeses, and packaged produce its characteristic vehicles. Listeria recalls skew heavily toward Class I and frequently arise from environmental and product testing at the plant rather than from a measured outbreak, which means a Listeria recall sometimes precedes any reported illness. Salmonellais the workhorse pathogen—one of the largest causes of foodborne illness, hospitalization, and death by total burden—spread across an enormous range of vehicles from poultry and eggs (the FSIS and FDA boundary literally runs through the egg) to produce, peanut products, spices, and pet food. Shiga toxin-producing E. coli, above all the O157:H7 serotype, is the classic ground-beef and leafy-greens pathogen and the cause of the most notorious outbreaks; its capacity to cause hemolytic uremic syndrome in children makes its recalls reliably Class I.

For the join, these pathogens are the connective tissue. The CDC's etiologyfield and the recalls' reason_for_recall text both name them, so a recall attributed to Listeria can be lined up against the Listeria outbreaks of the same period. But the matching is messier than it looks: the CDC may record a precise serotype (Salmonella Enteritidis, E. coli O157:H7) while a recall reason says only “Salmonella” or “possible E. coli contamination,” and the recall reason is free text that spells the same organism several ways—Escherichia coli, E. coli, E.coli, STEC. Normalizing the pathogen names to a common vocabulary is the first and most consequential step of any cross-dataset food-safety analysis, and the single biggest determinant of how many real links the join recovers.

The chain: from outbreak to recall

Assembled, the three datasets reconstruct the central chain of the food-safety system: detection, attribution, and removal. In the canonical sequence, people fall ill; clinicians and labs report the cases; whole-genome sequencing links cases that share a genetic fingerprint into an outbreak; the CDC and its partners trace the outbreak to a food and a producer; and the responsible agency—FSIS for meat and poultry, the FDA for everything else—issues the recall that pulls the product off shelves. The CDC record captures the front of that chain, the recall feeds capture the end, and the join across pathogen, food, firm, and time reconstructs the whole.

Two questions fall out of putting the chain together, and both are central to food-safety policy. The first is how often an outbreak is actually tied to a recall, and how long that takes. Ordering outbreaks and recalls in time for the same pathogen and food window lets an analyst estimate the lag from the start of an outbreak to the recall that ends it—a direct measure of how fast the system responds—and to count the outbreaks that never produced a recall at all, whether because the food was never identified, the source was a restaurant rather than a recallable product, or the contaminated lot was gone before it could be pulled. The second question runs the other direction: not every recall comes from an outbreak. A large and growing share of recalls—especially the Listeria and allergen recalls—are preventive, triggered by a positive environmental or product test before anyone is known to have gotten sick. Distinguishing outbreak-driven recalls from preventive recalls, which the join makes possible, is itself a measure of how far the system has moved from reacting to illness toward preventing it.

FSMA and the shift toward prevention

The most important regulatory development behind the FDA side of the data is the Food Safety Modernization Act (FSMA), signed in 2011 and implemented through a suite of rules over the following years. FSMA was the most sweeping reform of US food-safety law in decades, and its animating idea was a shift in posture from responding to contamination to preventing it. Before FSMA the FDA largely reacted—it acted after illness or contamination surfaced. FSMA required food facilities to build and follow written preventive-controls plans, established produce-safety standards for the farm, created a foreign-supplier verification regime for imported food, and—pivotally for the recall data—gave the FDA mandatory recall authority for the first time, where it had previously depended almost entirely on firms recalling voluntarily.

For an analyst, FSMA is the reason the recall data is not a stationary series. Because the law pushed firms toward environmental monitoring and preventive controls, more contamination is now caught by testing before it causes illness, which should show up as a rising share of preventive, test-driven recalls and, ideally, as a shrinking lag between contamination and removal. Comparing recall timeliness—the gap between recall initiation and FDA publication, and the lag from a linked outbreak's onset to the recall—across the pre- and post-FSMA eras is one of the more policy-relevant analyses the joined data supports. It must be done carefully, because reporting practices and the openFDA record itself changed over the same period, but the question—did the prevention paradigm make recalls faster and more anticipatory?—is exactly the kind the three datasets together can speak to and that none can answer alone.

Python workflow: joining recalls to outbreaks by pathogen

The script below pulls recalls by pathogen from both the openFDA food-enforcement endpoint and the FSIS recall API, compares the two streams pathogen by pathogen, and breaks the FDA recalls down by recall class. It is the recall half of the chain; lining the recalls up against the CDC NORS outbreaks for the same pathogen and period—the third source—is the natural next step, and the structure here makes the temporal join straightforward to add. No API key is required for public data, though a free openFDA key raises the rate ceiling. Because the FSIS recall reason lives in a field whose name varies between releases, the script scans every text column for the pathogen rather than hard-coding a column, and any production use should be validated against the current FSIS and openFDA schemas and should page through the full result set.

import requests, pandas as pd
from collections import Counter

# Three public, key-free federal food-safety sources, joined by pathogen
# and time window:
#   1. FDA food enforcement (recalls of all non-meat food)  -- openFDA
#   2. USDA FSIS recalls (meat, poultry, processed egg)      -- FSIS recall API
#   3. CDC NORS foodborne outbreaks                          -- annual data export
# openFDA allows ~240 req/min and 1,000/day without a key; a free key at
# https://open.fda.gov/apis/authentication/ raises the ceiling.

FDA = "https://api.fda.gov/food/enforcement.json"
FSIS = "https://www.fsis.usda.gov/fsis/api/recall/v/1"

PATHOGENS = ["Listeria", "Salmonella", "Escherichia coli", "E. coli"]


def fda_recalls(pathogen, since="2015-01-01", limit=1000):
    # reason_for_recall is free text; openFDA search matches it loosely.
    q = (f’reason_for_recall:"{pathogen}"+AND+'
         f’report_date:[{since.replace("-", "")}+TO+99991231]')
    url = f"{FDA}?search={q}&limit={limit}"
    r = requests.get(url, timeout=120)
    if r.status_code == 404:        # openFDA returns 404 for zero hits
        return pd.DataFrame()
    r.raise_for_status()
    rows = r.json().get("results", [])
    df = pd.DataFrame(rows)
    df["agency"] = "FDA"
    return df


def fsis_recalls():
    # FSIS publishes the full open recall set as one JSON document.
    r = requests.get(FSIS, timeout=120)
    r.raise_for_status()
    df = pd.DataFrame(r.json())
    df["agency"] = "FSIS"
    return df


def fsis_by_pathogen(df, pathogen):
    # The reason text lives in a field whose name varies by release; scan
    # every string column for the pathogen rather than hard-coding one.
    text = df.select_dtypes(include="object").apply(
        lambda col: col.str.contains(pathogen, case=False, na=False))
    return df[text.any(axis=1)]


def compare_streams():
    summary = {}
    for p in ["Listeria", "Salmonella", "Escherichia coli"]:
        fda = fda_recalls(p)
        fsis_all = fsis_recalls()
        fsis = fsis_by_pathogen(fsis_all, "coli" if "coli" in p else p)
        summary[p] = {"fda_recalls": len(fda), "fsis_recalls": len(fsis)}
    out = pd.DataFrame(summary).T
    out["fda_share"] = out["fda_recalls"] / (
        out["fda_recalls"] + out["fsis_recalls"]).clip(lower=1)
    print("Recalls by pathogen and agency:")
    for p, row in out.iterrows():
        print(f"  {p:<18} FDA {int(row.fda_recalls):>4}  "
              f"FSIS {int(row.fsis_recalls):>4}  "
              f"(FDA share {row.fda_share:.0%})")
    return out


def recall_class_mix(pathogen="Listeria"):
    # FDA classifies recalls I/II/III by health hazard; Class I is the most
    # serious. Listeria recalls skew heavily Class I.
    df = fda_recalls(pathogen)
    if df.empty:
        return Counter()
    mix = Counter(df.get("classification", pd.Series(dtype=str)).dropna())
    print(f"\n{pathogen} FDA recall classes: {dict(mix)}")
    return mix


compare_streams()
recall_class_mix("Listeria")

Two practical notes. First, the stream comparison in the script is deliberately coarse: it counts FDA and FSIS recalls per pathogen to expose the jurisdictional asymmetry, but a rigorous comparison must normalize for the size of each agency's jurisdiction and restrict to a common date window and recall class, because raw counts conflate “more food” with “more risk.” Second, the outbreak join is left as the next step on purpose: it requires loading the CDC NORS export, normalizing its etiology to the same pathogen vocabulary the script applies to the recall reasons, and matching on pathogen plus a time window—tolerating the lag between an outbreak's onset and the recall that follows it. For national-scale work, the CDC's downloadable NORS data and the FSIS and FDA bulk or paginated feeds together are far more efficient than ad-hoc queries and carry the authoritative, version-stamped field definitions for each release.

Limitations and analytical caveats

Joining three federal datasets that were never designed to be joined is powerful, but the seams are real, and an analyst must hold several caveats firmly in mind before drawing conclusions.

There is no shared key, so the join is fuzzy. The link between an outbreak and a recall is reconstructed from pathogen names, product and food descriptions, firm names, and dates—all of which are recorded differently across the three agencies. Pathogen names are spelled inconsistently and recorded at different levels of specificity; product descriptions are free text; firm names vary in punctuation, suffixes, and subsidiaries. Every match is therefore probabilistic, and the rate of true links recovered depends entirely on the quality of the normalization. A naive exact-match join will find almost nothing; an over-eager fuzzy join will manufacture connections that are not there. The honest posture is to treat the assembled chain as a well-supported hypothesis rather than a ledger.

Outbreaks are a biased, lagging sample. NORS captures only the clustered, investigated outbreaks, not the much larger volume of sporadic foodborne illness, and many outbreaks never have an identified food vehicle, which caps how many can ever be matched to a specific recall. Outbreak data is also reported and finalized with substantial lag—an investigation takes time, and NORS is published in annual batches—so the most recent periods are systematically under-represented. The CDC record is authoritative for established patterns and multi-year trends; it is not a real-time monitor of what made people sick last month.

The two recall streams are not comparable at face value.The jurisdictional split means the FDA stream covers a far larger share of the food universe than the FSIS stream, so raw counts say more about the size of each jurisdiction than about agency performance or relative risk. The agencies also classify, code, and publish recalls under different conventions and on different cadences. Any FSIS-versus-FDA comparison must control for jurisdiction size, recall class, and date window, and should resist the temptation to read a larger count as a worse record.

A recall is an action, not an outcome, and a coded field is a summary. A recall measures that a product was pulled, not how much reached consumers, how much was recovered, or whether anyone was harmed; the recall class is a hazard judgment, not a casualty count. The free-text reason and product fields compress the real detail of an event into terse summaries, and the full texture—the lot codes, the distribution, the corrective action—lives in the underlying agency documents, not the structured feed. Treating a recall count as a measure of harm, or a clean reason field as a complete account of the cause, over-reads what the data can bear.

Held with these caveats in mind, the three tables together— cdc_foodborne_outbreaks, usda_fsis_recalls, and fda_food_enforcement—are uniquely valuable: the only way to see the US food-safety system whole, tracing the chain from the illnesses that reveal a contaminated food to the recalls that finally pull it off the shelf, across a jurisdictional divide that the data, but never the contamination, respects.

Related writing

CDC Foodborne Outbreak Data: The Federal Database Behind Every US Food Poisoning Investigation — The epidemiological spine of this join, the National Outbreak Reporting System records the pathogen, implicated food, case, hospitalization, and death counts that connect illnesses to the recalls that follow.

USDA FSIS Food Safety Data: The Federal Recall Database and Inspection Records Behind Meat, Poultry, and Egg Safety — The meat-and-poultry half of the recall picture, backed by FSIS's unique continuous in-plant inspection, supplies one of the two recall streams the chain depends on.

FDA Food Enforcement Reports: The Federal Database Behind Food and Cosmetic Recalls — The far larger FDA side covers all non-meat food and cosmetics, and the openFDA enforcement endpoint and recall classes are the backbone of the recall queries in this article.