Technical writing

CDC Foodborne Outbreak Data: The Federal Database Behind Every US Food Poisoning Investigation

January 25, 2027· 16 min read· AI Analytics

CDCFoodborne IllnessOutbreaksFood SafetyFederal Data

The CDC Foodborne Disease Outbreak Surveillance System (FDOSS) records every foodborne illness outbreak reported to CDC since 1998 — etiology, implicated food, setting, illnesses, hospitalizations, and deaths — providing the primary evidence base for food recall decisions and food safety policy.

When investigators trace a cluster of illnesses back to a bag of romaine lettuce or a jar of peanut butter, the resulting case data eventually flows into FDOSS. The system is not a complete picture of foodborne illness in the United States — it is an iceberg tip, capturing only outbreak-associated cases that are investigated and reported by state and local health departments. But it is the most granular federal record available of where outbreaks occur, what pathogens cause them, what foods are implicated, and how severe they are, making it the foundation on which food recall policy, produce safety regulations, and pathogen-specific intervention strategies are built.

What FDOSS is

The Foodborne Disease Outbreak Surveillance System is operated by the CDC Division of Foodborne, Waterborne, and Environmental Diseases. State and local health departments investigate foodborne illness outbreaks and submit outbreak reports to CDC; FDOSS aggregates these reports into the national database. An outbreak is defined as two or more persons experiencing a similar illness after consuming food or beverages from a common source. Single sporadic cases are not counted. This definition immediately limits what the system captures: a person sickened by contaminated home-cooked chicken who never seeks care, or whose physician never orders a stool culture, or whose culture is never subtyped and matched to a cluster, contributes nothing to FDOSS.

Approximately 800 to 1,000 outbreaks are reported to FDOSS each year, generating roughly 15,000 illnesses, 800 hospitalizations, and 20 deaths among reported outbreak cases. Annual results are published in the Morbidity and Mortality Weekly Report (MMWR) as “Surveillance for Foodborne Disease Outbreaks, United States” — the primary federal summary of outbreak trends. The numbers sound modest, but they represent only the visible portion of a far larger burden. CDC estimates that 48 million Americans experience foodborne illness each year from all causes, resulting in 128,000 hospitalizations and 3,000 deaths annually — meaning FDOSS captures roughly one in 3,000 actual foodborne illness cases. The ratio reflects underreporting at every step: most ill people never see a physician, most physicians never order stool cultures, most positive cultures are never linked to an outbreak, and most outbreaks are never formally investigated.

FDOSS records are submitted through the National Outbreak Reporting System (NORS), a web-based platform that replaced paper reporting forms beginning in 2009. NORS collects structured data on each outbreak: pathogen (confirmed or suspected), food vehicle, food preparation setting, contamination factor (improper holding temperature, infected food handler, inadequate cooking, etc.), number of ill persons, hospitalizations, deaths, and laboratory confirmation status. Cases with confirmed etiology — where laboratory testing identified the causative pathogen — are distinguished from those with suspected etiology based on incubation period, symptoms, and epidemiological pattern.

Leading outbreak pathogens

Norovirus

Norovirus is the most common cause of foodborne illness outbreaks in the United States, accounting for approximately half of all reported outbreaks in FDOSS. Its dominance reflects both biological and epidemiological characteristics: Norovirus has an extremely low infectious dose (approximately 18 viral particles are sufficient to cause infection), survives on surfaces and in shellfish for extended periods, and spreads efficiently in settings where contaminated food or surfaces contact multiple individuals. Restaurant buffets, catered events, and cruise ship dining facilities are classic high-transmission environments. The most common contamination pathway is the infected food handler — a person who is ill or recently recovered and who handles ready-to-eat food with bare hands.

Norovirus is not uniformly distributed across food categories. Leafy greens and fresh produce contaminated during production or processing, shellfish (particularly bivalves that filter-feed in sewage-contaminated coastal waters), and foods handled extensively during preparation are common vehicles. Because Norovirus is not a regulated pathogen in the same way bacterial pathogens are — there are no routine Norovirus tests in food production environments — prevention depends heavily on hand hygiene and exclusion of ill food workers from food handling. The illness is self-limiting in healthy individuals (12–60 hours of vomiting and diarrhea), which means it generates low hospitalization rates in FDOSS despite producing the highest outbreak counts.

Salmonella

Salmonella is the second most common cause of foodborne outbreaks and the leading cause of foodborne illness hospitalizations and deaths in FDOSS data. CDC estimates 1.35 million Salmonella infections annually, with roughly 26,500 hospitalizations and 420 deaths — a case fatality rate substantially higher than Norovirus. The pathogen colonizes the intestinal tracts of a wide range of animals and contaminates food through fecal contact during slaughter, handling, or irrigation. Eggs, poultry, raw produce, nut butters, and raw flour have all been implicated in major multistate outbreaks.

Within Salmonella, serotyping identifies the specific strain involved. The four most common serotypes in US outbreak investigations are Newport, Enteritidis, Typhimurium, and Infantis, though the distribution shifts year to year based on which serotypes are circulating in animal reservoirs. Serotype data matters for traceback: when a cluster of Salmonella Newport cases appears in PulseNet, investigators know to look for a shared food source consistent with Newport's known animal reservoir associations. Whole genome sequencing has progressively replaced pulsed-field gel electrophoresis (PFGE) as the primary subtyping method; WGS provides far higher discriminatory power, distinguishing outbreak strains from background strains that happen to share the same serotype.

Clostridium perfringens

Clostridium perfringens is the third most common outbreak pathogen in FDOSS and is almost exclusively associated with improper temperature control in large-batch food service settings. The organism forms heat-resistant spores that survive cooking; if cooked beef or poultry is then held at warm temperatures (between roughly 12°C and 54°C) rather than rapidly cooled, surviving spores germinate and the bacteria multiply to levels capable of causing illness. Catered events, institutional cafeterias, and holiday meals with large roasts held at inadequate warming temperatures are the classic outbreak scenarios. The illness — primarily diarrhea and abdominal cramps beginning 6–24 hours after eating — is self-limiting and rarely results in hospitalization.

Campylobacter

Campylobacter is the most common cause of bacterial gastroenteritis globally and a significant FDOSS pathogen, though it appears less prominently in outbreak data than in sporadic case surveillance because Campylobacter outbreaks are often difficult to recognize — cases typically occur individually rather than in large clusters at shared events. The primary vehicles in outbreak settings are undercooked poultry and raw or inadequately pasteurized milk. Campylobacter is unusual among foodborne pathogens in that it does not multiply in food: it must reach the food in sufficient quantity from the original animal source to cause infection, which limits its outbreak potential compared to pathogens that amplify in improperly held food. Guillain-Barré syndrome — a post-infectious neurological condition — is a rare but severe sequela of Campylobacter infection.

E. coli O157:H7

Escherichia coli O157:H7 produces Shiga toxin and is one of the most closely watched pathogens in food safety surveillance due to the severity of illness it can cause. Shiga toxin-producing E. coli (STEC) can cause hemolytic uremic syndrome (HUS) — a complication involving destruction of red blood cells, platelet reduction, and acute kidney failure — particularly in children under five and the elderly. HUS develops in approximately 5–10 percent of STEC O157:H7 cases and can be fatal or cause permanent kidney damage. The infectious dose is extraordinarily low (approximately 10 organisms), meaning even a small amount of contaminated food can cause illness.

The primary reservoirs are cattle, and ground beef was the historically dominant vehicle. The 1993 Jack in the Box outbreak — 73 HUS cases, four deaths — catalyzed a fundamental shift in USDA food safety regulation, leading to the declaration of E. coli O157:H7 as an adulterant in ground beef and mandatory testing. More recently, fresh leafy greens — particularly romaine lettuce and spinach — have become a major vehicle, contaminated through irrigation water drawn from sources with cattle fecal runoff. Whole genome sequencing has revolutionized outbreak attribution: the 2018–2019 romaine lettuce outbreaks demonstrated WGS's power to link cases across multiple states to a specific growing region even before a contaminated product lot was confirmed.

Listeria monocytogenes

Listeria monocytogenes causes the highest case fatality rate of any foodborne pathogen routinely tracked in FDOSS — approximately 20 percent of diagnosed cases are fatal. The pathogen disproportionately affects pregnant women (in whom it can cause miscarriage, stillbirth, or severe neonatal illness), adults over 65, and immunocompromised individuals; healthy adults rarely develop serious illness. Listeria is unusual among foodborne pathogens in that it grows at refrigerator temperatures (2–4°C), meaning refrigeration does not prevent multiplication in contaminated food. This characteristic allows Listeria to persist and amplify in ready-to-eat foods stored under standard refrigeration.

Deli meats, soft cheeses made from unpasteurized milk, smoked seafood, and cantaloupes are among the most frequently implicated vehicles. What makes Listeria particularly challenging from a food safety standpoint is its ability to establish persistent environmental contamination in food production facilities: certain strains colonize drains, floor cracks, refrigeration equipment, and food contact surfaces, persisting for years or even decades despite cleaning and sanitation efforts. When a production environment harbors a persistent Listeria strain, food products manufactured in that facility can be contaminated repeatedly over extended periods, making outbreak investigations more difficult because the exposure window may span months.

Hepatitis A

Hepatitis A virus (HAV) causes outbreaks primarily through shellfish harvested from sewage-contaminated coastal waters — bivalves (clams, oysters, mussels) filter feed and bioaccumulate virus from the surrounding water — and through food handlers who are infected and have not been vaccinated. HAV infection has a long incubation period (15–50 days), which complicates traceback because the implicated food may no longer be available by the time cases are recognized as an outbreak. The hepatitis A vaccine is safe, effective, and available, making food handler vaccination a public health intervention with clear outbreak prevention potential. Several states mandate hepatitis A vaccination for food handlers as a result. Frozen berry outbreaks linked to international supply chains have been among the most prominent HAV foodborne events in recent years.

Major outbreak investigations

Several high-profile outbreaks illustrate how FDOSS data functions within the broader outbreak investigation infrastructure.

The 2011 Jensen Farms cantaloupe Listeria outbreak remains the deadliest US foodborne outbreak since the 1924 typhoid epidemic in Shelby, Michigan. A total of 147 persons across 28 states were infected; 33 died. The contamination was traced to a Colorado cantaloupe packinghouse where equipment design allowed water pooling and Listeria persistence. The outbreak demonstrated the vulnerability of whole fresh produce — melons in particular, whose rough netted surfaces harbor bacteria — to packinghouse environmental contamination, and led to enhanced FDA regulatory focus on produce packing facilities.

The 2015 Blue Bell Creameries Listeria outbreak produced 10 confirmed illnesses and 3 deaths across eight states. Listeria strains were identified in ice cream products from Blue Bell's Brenham, Texas facility, and WGS linked patients to isolates collected from the production environment. Investigation revealed contamination had been present in the facility for years, with WGS retrospectively linking a 2014 hospital cluster in Kansas to the same environmental strain. Blue Bell voluntarily recalled all products and temporarily shut all three production facilities — at the time, the largest ice cream recall in US history.

The 2018–2019 E. coli O157:H7 outbreaks in romaine lettuce involved multiple separate outbreak events — including a Thanksgiving 2019 outbreak ultimately traced to a Salinas, California growing region — with a combined toll exceeding 210 cases, 96 hospitalizations, and 5 deaths. WGS was central to linking cases across states and distinguishing the outbreak strain from background E. coli O157 cases. The outbreaks led FDA to negotiate voluntary enhanced water testing requirements for the Salinas Valley and Yuma, Arizona growing regions and prompted the California Leafy Greens Marketing Agreement to strengthen water quality standards.

The 2013 Foster Farms Salmonella Heidelberg outbreak involved approximately 600 ill persons and was notable for the USDA's decision not to require a product recall despite the extended outbreak. At the time, the USDA's FSIS regulatory framework did not classify Salmonella as an adulterant in raw poultry (unlike E. coli O157:H7 in ground beef), limiting the agency's authority to compel recall of products that tested positive. The outbreak accelerated regulatory discussions that eventually resulted in updated USDA Salmonella performance standards for poultry facilities.

Outbreak investigation methods

A foodborne outbreak investigation combines epidemiological analysis with laboratory testing and product traceback. The initial step is hypothesis generation: epidemiologists interview case patients using standardized food frequency questionnaires (FFQs) covering consumption of dozens of food items in the days before illness onset. Spot maps visualize case geography; attack rate calculations identify which foods (or settings) have the highest proportion of ill persons among those exposed.

Analytical epidemiology — case-control or cohort studies — provides statistical evidence of association between a specific food and illness. A case-control study compares food exposures reported by ill persons (cases) to those of comparable well persons (controls) and calculates odds ratios. An implicated food typically shows an odds ratio substantially greater than 1.0, with a confidence interval that excludes the null. Epidemiological evidence alone is often insufficient for regulatory action; laboratory confirmation linking the outbreak pathogen strain to the implicated food or production environment is typically required before a recall is issued.

Whole genome sequencing has transformed the traceback phase of outbreak investigation. The PulseNet network — a CDC-coordinated system of public health and food regulatory laboratories across all 50 states and several federal agencies — sequences bacterial isolates from clinical cases and from food and environmental samples. WGS data identifies genetic clusters: when clinical isolates from patients in multiple states are genetically near-identical (differing by zero to five single nucleotide polymorphisms in a genome of several million bases), they almost certainly share a common source even if the patients have no known contact with each other. PulseNet cluster detection signals have become the primary trigger for multistate outbreak investigations, allowing investigators to identify outbreaks that traditional epidemiological surveillance would not recognize because cases are too dispersed to appear as a local cluster.

Environmental sampling complements laboratory case analysis. When a production facility or growing region is implicated, FDA, USDA, and state regulatory agencies collect environmental samples from equipment surfaces, water sources, soil, and finished product. Matching WGS profiles between environmental and clinical isolates provides the definitive microbiological evidence linking a production source to a human outbreak. The GenomeTrakr network, coordinated by FDA, maintains a public repository of bacterial whole genome sequences from food and environmental sources — a resource investigators can query to identify whether an outbreak strain has previously been found in a specific facility or geographic region.

FDA Food Safety Modernization Act

The FDA Food Safety Modernization Act (FSMA), signed into law in January 2011, represents the most significant reform of the federal food safety regulatory framework since 1938. The core FSMA philosophy is a shift from response — recalling contaminated products after illness occurs — to prevention — requiring food facilities to systematically identify and control hazards before food reaches consumers.

The Preventive Controls for Human Food rule (21 CFR Part 117), finalized in 2015, requires all registered food facilities to maintain written food safety plans that include a hazard analysis (identifying biological, chemical, and physical hazards associated with each food and process), preventive controls (process controls, food allergen controls, sanitation controls, and supply-chain controls), monitoring procedures, corrective action procedures, and verification activities. The rule applies to facilities ranging from large processors to smaller manufacturers and introduced a risk-based compliance approach: facilities processing foods with higher hazard potential face more stringent requirements.

The Produce Safety Rule (21 CFR Part 112) establishes science-based standards for the safe growing, harvesting, packing, and holding of fruits and vegetables consumed raw. Key provisions address agricultural water quality — requiring farms to test irrigation water for generic E. coli as an indicator of fecal contamination — worker health and hygiene, equipment and tool sanitation, and protection from domesticated and wild animal intrusion. The water quality standards have been among the most contested aspects of implementation, with the FDA revising the testing methodology and compliance timelines multiple times following scientific and industry feedback.

FSMA Section 204 establishes enhanced traceability record-keeping requirements for high-risk foods identified on the FDA's Food Traceability List — a category that includes leafy greens, fresh tomatoes, nut butters, shell eggs, and several other produce items with documented outbreak histories. Covered firms must maintain key data elements (KDEs) at critical tracking events (CTEs) — harvesting, cooling, packing, shipping, receiving — in a standardized format that enables FDA to trace a contaminated product from consumer to farm in hours rather than days. The traceability rule represents a direct regulatory response to the operational lessons of major produce outbreak investigations, which repeatedly demonstrated that inadequate recordkeeping extended the time needed to identify and remove contaminated product from commerce.

Data access

FDOSS outbreak data is publicly available through several channels. The primary download portal is at cdc.gov/fdoss/data/index.html, where CDC publishes Excel spreadsheets of outbreak data by year and by state, as well as SAS-format analytical datasets for researchers. Annual data typically becomes available 12–18 months after the reference year, reflecting the time required for states to complete outbreak investigations and submit final reports.

Programmatic access is available through the Socrata API on data.cdc.gov at dataset ID 9c22-jgdb. The endpoint accepts standard Socrata SoQL parameters for filtering, aggregation, and pagination. Key FDOSS fields available in the Socrata dataset include year, state, etiology (causative pathogen, confirmed or suspected), serotype (for pathogens where serotyping applies),food_category (beef, dairy, chicken, leafy vegetables, fruits, fish, etc.), setting_type (restaurant, private home, catering, institution, school, etc.), illnesses, hospitalizations, deaths, and primary_mode (the contamination factor: improper holding temperature, infected food handler, inadequate cooking, contaminated equipment, etc.).

The CDC's FoodNet (Foodborne Diseases Active Surveillance Network) provides a complementary active surveillance dataset at cdc.gov/foodnet/. Unlike FDOSS, which relies on passive outbreak reporting, FoodNet conducts population-based surveillance in 10 sentinel sites across the US — accounting for roughly 15 percent of the US population — and actively tracks laboratory-confirmed cases of nine foodborne pathogens regardless of outbreak linkage. FoodNet data enables CDC to calculate incidence rates and estimate the proportion of foodborne illness that is actually captured in outbreak surveillance. FDA food enforcement actions — including recall notices — are published at the FDA's enforcement report portal at accessdata.fda.gov/scripts/enforcement/enforce_rpt-Product-Tabs.cfm.

Python analysis: FDOSS outbreak patterns

The FDOSS Socrata endpoint enables structured queries across the full historical dataset. Because FDOSS contains outbreak-level records rather than case-level rows, datasets are substantially smaller than case surveillance systems like NNDSS — a few thousand records per decade — making full-dataset retrieval feasible without aggressive server-side aggregation. The following script computes five analytical summaries: annual outbreak counts by pathogen (top eight), illness burden by food category, case fatality and hospitalization rates by pathogen, outbreak setting distribution, and states with the highest per-capita outbreak counts.

import requests
import pandas as pd
from collections import defaultdict

# -------------------------------------------------------
# CDC Foodborne Disease Outbreak Surveillance System (FDOSS)
# Analysis via data.cdc.gov Socrata API
#
# This script:
#   1. Fetches FDOSS outbreak data (dataset 9c22-jgdb)
#   2. Annual outbreak counts by pathogen (top 8, 2010-2022)
#   3. Illness counts by food category
#   4. Case fatality rate by pathogen (Listeria / Salmonella /
#      E. coli)
#   5. Outbreak setting distribution (restaurant / home /
#      caterer / institution)
#   6. States with most outbreaks per capita
# -------------------------------------------------------

FDOSS_ENDPOINT = "https://data.cdc.gov/resource/9c22-jgdb.json"

# Census 2022 state population estimates (select states shown)
CENSUS_POP_2022 = {
    "Alabama": 5074296,      "Alaska": 733583,        "Arizona": 7359197,
    "Arkansas": 3045637,     "California": 39029342,  "Colorado": 5839926,
    "Connecticut": 3626205,  "Delaware": 1018396,     "Florida": 22244823,
    "Georgia": 10912876,     "Hawaii": 1440196,       "Idaho": 1939033,
    "Illinois": 12582032,    "Indiana": 6833037,      "Iowa": 3200517,
    "Kansas": 2937150,       "Kentucky": 4512310,     "Louisiana": 4590241,
    "Maine": 1385340,        "Maryland": 6164661,     "Massachusetts": 6981974,
    "Michigan": 10034113,    "Minnesota": 5717184,    "Mississippi": 2940057,
    "Missouri": 6177957,     "Montana": 1122867,      "Nebraska": 1967923,
    "Nevada": 3177772,       "New Hampshire": 1395231,"New Jersey": 9261699,
    "New Mexico": 2113344,   "New York": 19677151,    "North Carolina": 10698973,
    "North Dakota": 779261,  "Ohio": 11756058,        "Oklahoma": 4019800,
    "Oregon": 4240137,       "Pennsylvania": 12972008,"Rhode Island": 1093734,
    "South Carolina": 5282634,"South Dakota": 909824, "Tennessee": 7051339,
    "Texas": 30029572,       "Utah": 3380800,         "Vermont": 647464,
    "Virginia": 8683619,     "Washington": 7785786,   "West Virginia": 1775156,
    "Wisconsin": 5892539,    "Wyoming": 581381,
}


def fetch_fdoss(params: dict, page_size: int = 5000) -> list[dict]:
    """Paginate through CDC FDOSS Socrata endpoint."""
    records = []
    offset = 0
    while True:
        paged = {**params, "\$limit": page_size, "\$offset": offset}
        resp = requests.get(FDOSS_ENDPOINT, params=paged, timeout=60)
        resp.raise_for_status()
        batch = resp.json()
        if not batch:
            break
        records.extend(batch)
        print(f"  Fetched {len(records):,} records so far...")
        if len(batch) < page_size:
            break
        offset += page_size
    return records


def to_int(val, default: int = 0) -> int:
    try:
        return int(float(val))
    except (TypeError, ValueError):
        return default


# -------------------------------------------------------
# Step 1: Annual outbreak counts by pathogen (top 8),
#         2010-2022
# -------------------------------------------------------
print("Fetching FDOSS outbreak records 2010-2022...")

raw = fetch_fdoss({
    "\$select": "year, etiology, COUNT(*) AS outbreaks",
    "\$where": "year >= 2010 AND year <= 2022 AND etiology IS NOT NULL",
    "\$group": "year, etiology",
    "\$order": "year ASC, etiology ASC",
})

df = pd.DataFrame(raw)
df["year"] = pd.to_numeric(df["year"], errors="coerce")
df["outbreaks"] = pd.to_numeric(df["outbreaks"], errors="coerce").fillna(0)

# Identify top 8 pathogens by total outbreak count
pathogen_totals = (
    df.groupby("etiology")["outbreaks"]
    .sum()
    .sort_values(ascending=False)
)
top8 = pathogen_totals.head(8).index.tolist()

print("\nTop 8 Foodborne Outbreak Pathogens by Total Outbreaks (2010-2022)")
print("-" * 60)
print(f"{\'Pathogen\':<35} {\'Outbreaks\':<10}")
print("-" * 60)
for p in top8:
    print(f"{p:<35} {int(pathogen_totals[p]):>10,}")

# Year-by-year for top 8
top8_df = df[df["etiology"].isin(top8)].copy()
pivot = top8_df.pivot_table(
    index="etiology", columns="year", values="outbreaks", aggfunc="sum"
).fillna(0).astype(int)
pivot = pivot.reindex(top8)

print("\nAnnual Outbreak Count by Pathogen (top 8)")
print("-" * 100)
years = sorted(top8_df["year"].dropna().unique().astype(int).tolist())
print(f"{\'Pathogen\':<35}", end="")
for y in years:
    print(f"  {y:>4}", end="")
print()
print("-" * 100)
for p in top8:
    print(f"{p:<35}", end="")
    for y in years:
        val = pivot.loc[p, y] if y in pivot.columns else 0
        print(f"  {int(val):>4}", end="")
    print()


# -------------------------------------------------------
# Step 2: Illness counts by food category
# -------------------------------------------------------
print("\nFetching illness counts by food category (2010-2022)...")

food_raw = fetch_fdoss({
    "\$select": "food_category, SUM(illnesses) AS total_illnesses, COUNT(*) AS outbreaks",
    "\$where": "year >= 2010 AND year <= 2022 AND food_category IS NOT NULL AND illnesses IS NOT NULL",
    "\$group": "food_category",
    "\$order": "total_illnesses DESC",
    "\$limit": "50",
})

food_df = pd.DataFrame(food_raw)
food_df["total_illnesses"] = pd.to_numeric(food_df["total_illnesses"], errors="coerce").fillna(0)
food_df["outbreaks"] = pd.to_numeric(food_df["outbreaks"], errors="coerce").fillna(0)
food_df["illnesses_per_outbreak"] = (food_df["total_illnesses"] / food_df["outbreaks"]).round(1)
food_df = food_df.sort_values("total_illnesses", ascending=False).head(15).reset_index(drop=True)
food_df.index += 1

print("\nFoodborne Illnesses by Food Category (2010-2022, top 15)")
print("-" * 75)
print(f"{\'Food Category\':<30} {\'Total Illnesses\':<18} {\'Outbreaks\':<12} {\'Ill/Outbreak\':<12}")
print("-" * 75)
for rank, row in food_df.iterrows():
    print(
        f"{str(row[\'food_category\']):<30} "
        f"{int(row[\'total_illnesses\']):>18,} "
        f"{int(row[\'outbreaks\']):>12,} "
        f"{row[\'illnesses_per_outbreak\']:>12.1f}"
    )


# -------------------------------------------------------
# Step 3: Case fatality rate by pathogen
#         Focus: Listeria, Salmonella, E. coli
# -------------------------------------------------------
print("\nFetching outcomes data for CFR comparison by pathogen...")

cfr_raw = fetch_fdoss({
    "\$select": "etiology, SUM(illnesses) AS illnesses, SUM(hospitalizations) AS hospitalizations, SUM(deaths) AS deaths, COUNT(*) AS outbreaks",
    "\$where": (
        "year >= 2010 AND year <= 2022 "
        "AND illnesses IS NOT NULL "
        "AND (etiology LIKE \'%Listeria%\' OR etiology LIKE \'%Salmonella%\' OR etiology LIKE \'%Escherichia%\' OR etiology LIKE \'%E. coli%\')"
    ),
    "\$group": "etiology",
    "\$order": "illnesses DESC",
})

cfr_df = pd.DataFrame(cfr_raw)
for col in ["illnesses", "hospitalizations", "deaths", "outbreaks"]:
    cfr_df[col] = pd.to_numeric(cfr_df[col], errors="coerce").fillna(0)
cfr_df["hosp_rate_pct"] = ((cfr_df["hospitalizations"] / cfr_df["illnesses"]) * 100).round(2)
cfr_df["cfr_pct"] = ((cfr_df["deaths"] / cfr_df["illnesses"]) * 100).round(3)
cfr_df = cfr_df.sort_values("illnesses", ascending=False).reset_index(drop=True)

print("\nCase Fatality & Hospitalization Rates by Pathogen (2010-2022)")
print("-" * 85)
print(f"{\'Pathogen\':<35} {\'Illnesses\':<12} {\'Hosp%\':<10} {\'CFR%\':<10} {\'Deaths\':<8}")
print("-" * 85)
for _, row in cfr_df.iterrows():
    print(
        f"{str(row[\'etiology\']):<35} "
        f"{int(row[\'illnesses\']):>12,} "
        f"{row[\'hosp_rate_pct\']:>10.2f} "
        f"{row[\'cfr_pct\']:>10.3f} "
        f"{int(row[\'deaths\']):>8,}"
    )


# -------------------------------------------------------
# Step 4: Outbreak setting distribution
# -------------------------------------------------------
print("\nFetching outbreak setting distribution (2010-2022)...")

setting_raw = fetch_fdoss({
    "\$select": "setting_type, COUNT(*) AS outbreaks, SUM(illnesses) AS illnesses",
    "\$where": "year >= 2010 AND year <= 2022 AND setting_type IS NOT NULL",
    "\$group": "setting_type",
    "\$order": "outbreaks DESC",
})

setting_df = pd.DataFrame(setting_raw)
setting_df["outbreaks"] = pd.to_numeric(setting_df["outbreaks"], errors="coerce").fillna(0)
setting_df["illnesses"] = pd.to_numeric(setting_df["illnesses"], errors="coerce").fillna(0)
total_outbreaks = setting_df["outbreaks"].sum()
setting_df["pct_outbreaks"] = ((setting_df["outbreaks"] / total_outbreaks) * 100).round(1)
setting_df = setting_df.sort_values("outbreaks", ascending=False).head(10).reset_index(drop=True)
setting_df.index += 1

print("\nFoodborne Outbreak Setting Distribution (2010-2022, top 10)")
print("-" * 70)
print(f"{\'Setting\':<35} {\'Outbreaks\':<12} {\'%\':<8} {\'Illnesses\':<10}")
print("-" * 70)
for rank, row in setting_df.iterrows():
    print(
        f"{str(row[\'setting_type\']):<35} "
        f"{int(row[\'outbreaks\']):>12,} "
        f"{row[\'pct_outbreaks\']:>8.1f} "
        f"{int(row[\'illnesses\']):>10,}"
    )


# -------------------------------------------------------
# Step 5: States with most outbreaks per capita (2010-2022)
# -------------------------------------------------------
print("\nFetching state-level outbreak counts (2010-2022)...")

state_raw = fetch_fdoss({
    "\$select": "state, COUNT(*) AS outbreaks, SUM(illnesses) AS illnesses",
    "\$where": "year >= 2010 AND year <= 2022 AND state IS NOT NULL",
    "\$group": "state",
    "\$order": "outbreaks DESC",
})

state_df = pd.DataFrame(state_raw)
state_df["outbreaks"] = pd.to_numeric(state_df["outbreaks"], errors="coerce").fillna(0)
state_df["illnesses"] = pd.to_numeric(state_df["illnesses"], errors="coerce").fillna(0)
state_df["pop"] = state_df["state"].map(CENSUS_POP_2022)
state_df = state_df.dropna(subset=["pop"])
state_df["outbreaks_per_1m"] = ((state_df["outbreaks"] / state_df["pop"]) * 1_000_000).round(2)
ranked = state_df.sort_values("outbreaks_per_1m", ascending=False).head(15).reset_index(drop=True)
ranked.index += 1

print("\nTop 15 States: Foodborne Outbreaks per Million Population (2010-2022)")
print("-" * 70)
print(f"{\'Rank\':<6} {\'State\':<20} {\'Outbreaks\':<12} {\'Illnesses\':<12} {\'Per 1M Pop\':<12}")
print("-" * 70)
for rank, row in ranked.iterrows():
    print(
        f"{rank:<6} {str(row[\'state\']):<20} "
        f"{int(row[\'outbreaks\']):>12,} "
        f"{int(row[\'illnesses\']):>12,} "
        f"{row[\'outbreaks_per_1m\']:>12.2f}"
    )

A practical note on FDOSS field values: the etiology field contains a wide range of string values including confirmed pathogens, pathogen groups (e.g., “Salmonella, nontyphoidal”), suspected agents, and multi-etiology entries for outbreaks with more than one contributing pathogen. For clean pathogen analysis, filtering to confirmed etiology records and normalizing pathogen names before grouping produces more interpretable results than using raw field values directly. Similarly, food_category values reflect the FDOSS classification system, which has evolved over the dataset's history; mapping historical categories to a consistent modern taxonomy improves longitudinal comparability.

The per-capita outbreak ranking should be interpreted carefully. States with more active public health investigation capacity — larger local health departments, more outbreak investigation funding, stronger laboratory infrastructure — will report more outbreaks per capita not because they have more foodborne illness but because they are better at detecting and investigating it. Reporting intensity is an artifact of surveillance capacity as much as disease burden. FDOSS counts are most reliable as indicators of relative trends within a state over time and as signals of the pathogens and food categories of greatest public health concern nationally.