Technical writing
CDC NNDSS: The Federal Database Behind Reportable Disease Surveillance in the United States
The National Notifiable Diseases Surveillance System aggregates case reports from all 50 states and territories for 120+ nationally notifiable diseases — from salmonellosis and Lyme disease to HIV, hepatitis, measles, and emerging threats — providing the epidemiological backbone of US disease outbreak response.
When a clinician in rural Maine diagnoses a patient with Lyme disease, that case eventually becomes a data point at the CDC in Atlanta. The path is neither fast nor complete — the report moves from provider to county health department to state health department to CDC, each step introducing potential delay and loss — but the accumulated flow of millions of such reports each year constitutes the primary epidemiological record of infectious disease burden in the United States. That record is the National Notifiable Diseases Surveillance System.
What NNDSS is
NNDSS is the federal infrastructure through which CDC receives case-level or aggregate reports of nationally notifiable conditions from state and territorial health departments. The system's scope is defined jointly by the CDC and the Council of State and Territorial Epidemiologists (CSTE), a nongovernmental professional organization of public health practitioners from all 50 states, five US territories, and two freely associated states. CSTE proposes additions, removals, and case definition updates annually; CDC adopts these as the official federal notifiable condition list. As of the most recent annual review, there are more than 120 nationally notifiable conditions — a list that has grown substantially since NNDSS's predecessor systems began in the 1950s, reflecting both the emergence of new pathogens and the expansion of laboratory diagnostic capability.
The reporting chain runs in two directions simultaneously. Healthcare providers and clinical laboratories are legally required by state law — not federal law; disease reporting mandates are state authority — to report certain conditions to local or state health departments. Electronic lab reporting (ELR) systems now automate much of this flow for laboratory-confirmed diagnoses: positive test results transmit directly from laboratory information management systems to state health department surveillance platforms without manual transcription. States then forward case reports to CDC electronically via one of two systems: the legacy NETSS (National Electronic Telecommunications System for Surveillance), which transmits aggregate counts in a flat ASCII format developed in the 1980s, or the modern NNDSS messaging system using HL7 case report messages in XML format with standardized case investigation data elements.
Reporting completeness is a persistent and significant limitation. Not all states report all conditions consistently. Case definitions — the clinical and laboratory criteria that must be met for a case to be counted — are standardized nationally by CSTE, but testing rates, diagnostic coding practices, and public health department staffing vary enormously across jurisdictions. A condition that is well-captured in a state with robust public health infrastructure and high clinical testing rates may be dramatically undercounted in a resource-constrained state where testing is less common and case investigation capacity is limited. The result is that NNDSS counts are best understood as surveillance data — trend indicators and relative comparisons — rather than true population incidence.
Disease categories and annual case volumes
The 120+ notifiable conditions span a wide range of pathogen types and transmission routes. The approximate annual case volumes below reflect recent reporting years and illustrate the enormous range in disease burden captured by the system.
Foodborne and enteric diseases
Salmonellosis is among the most commonly reported conditions, with roughly 60,000 to 65,000 confirmed cases reported to NNDSS annually. The CDC estimates that only about 1 in 29 actual Salmonella infections is captured in the surveillance system — meaning the true annual burden approaches 1.35 million infections. Most cases are mild and self-limiting; patients recover without seeking care, and those who do seek care often do not have stool cultures ordered. Campylobacteriosis follows a similar pattern, with approximately 25,000 reported cases per year against a CDC-estimated true incidence of 1.5 million.
Shiga toxin-producing Escherichia coli (STEC) infections, which can cause hemolytic uremic syndrome and kidney failure in children, generate 8,000 to 10,000 reports annually. Listeriosis, though far less common at roughly 900 cases per year, carries high mortality (roughly 20 percent) and disproportionately affects pregnant women, neonates, adults over 65, and the immunocompromised — making it a high-priority surveillance target despite low total case counts. Other enteric conditions in the NNDSS system include Cyclosporiasis, Cryptosporidiosis, Giardiasis, and Vibriosis.
Vector-borne diseases
Lyme disease is the most reported vector-borne condition in the United States, with 30,000 to 60,000 confirmed and probable cases reported to NNDSS in recent years. CDC surveillance modeling estimates the actual annual incidence at approximately 476,000 infections — roughly 10 to 15 times reported counts. The undercount reflects limited testing in early disease (when symptoms are nonspecific), two-tier serological testing that misses some early infections, and the fact that a substantial fraction of Lyme disease is diagnosed and treated clinically without confirmatory laboratory testing that would qualify for NNDSS reporting. Geographic concentration is extreme: roughly 95 percent of cases occur in 14 states in the Northeast and upper Midwest, corresponding to the range of the Ixodes scapularis blacklegged tick. Warming temperatures have pushed the geographic distribution of reported cases northward and into previously low-incidence areas over the past two decades.
West Nile virus produces 2,000 to 3,000 symptomatic neuroinvasive and non-neuroinvasive cases per year in the NNDSS data, with year-to-year variation driven primarily by mosquito season conditions. Rocky Mountain spotted fever (RMSF), caused by Rickettsia rickettsii transmitted by the American dog tick, generates several thousand reports annually; despite its name, the current geographic concentration is in the South Atlantic and South Central states rather than the Rocky Mountain region. Ehrlichiosis and Anaplasmosis, tick-borne bacterial infections, each contribute several thousand cases per year and have increased substantially as tick populations expand their range.
Vaccine-preventable diseases
Measles surveillance is among the most closely watched indicators in the NNDSS system because measles elimination — declared achieved in the US in 2000 — requires sustained vaccination coverage above roughly 95 percent to maintain herd immunity. Reported cases typically number well under 1,000 per year in non-outbreak years, but the system is exquisitely sensitive to vaccination coverage gaps. In 2019, the US recorded 1,282 measles cases — the highest since 1992 — driven primarily by outbreaks in Orthodox Jewish communities in New York City and Rockland County where vaccine hesitancy had reduced coverage below elimination thresholds. In 2024, 285+ cases were recorded, with clusters in Texas and Florida connected to international importations into under-vaccinated communities. Each outbreak is a diagnostic of vaccine coverage failure at the community level.
Pertussis (whooping cough) presents a different challenge: 20,000 to 30,000 cases are reported annually despite high childhood vaccination coverage, reflecting the waning immunity profile of the acellular pertussis vaccines introduced in the 1990s. Dtap and Tdap vaccines provide strong protection for several years but less durable immunity than the older whole-cell vaccines they replaced; adolescents and adults who received childhood vaccination are increasingly susceptible as immunity wanes. Mumps and Rubella complete the vaccine-preventable set in NNDSS, with mumps occasionally producing multi-hundred-case outbreaks in college settings where close contact facilitates transmission.
Sexually transmitted and bloodborne infections
The STI burden in NNDSS data reflects a sustained national crisis. Chlamydia is the most commonly reported infectious condition in the entire system at approximately 1.6 million cases annually. Because chlamydia is frequently asymptomatic, the true incidence is substantially higher. Gonorrhea generates approximately 670,000 annual reports; the emergence of multidrug-resistant Neisseria gonorrhoeae strains has made treatment increasingly challenging, with the CDC updating gonorrhea treatment guidelines multiple times as successive first-line antibiotics have lost efficacy.
Syphilis reached a 30-year high in 2022 with approximately 207,000 cases across all stages reported to NNDSS. Congenital syphilis — transmitted from mother to infant during pregnancy — has increased at an alarming rate, from 334 cases in 2012 to over 3,700 cases in 2022, a greater than tenfold increase. Infants with congenital syphilis face serious long-term consequences including neurological damage, bone abnormalities, and death; the resurgence reflects gaps in prenatal care access and STI testing infrastructure in affected communities.
HIV surveillance through NNDSS records approximately 36,000 new diagnoses per year. NNDSS HIV reporting is among the most complete in the system — HIV is a high-priority condition with well-funded surveillance infrastructure — but new diagnoses represent only those identified through testing; an estimated 13 percent of people living with HIV in the US are unaware of their infection. Hepatitis B and hepatitis C surveillance captures approximately 20,000 and 66,000 new acute cases per year respectively, though chronic viral hepatitis is far more prevalent than acute case counts suggest.
Respiratory and other bacterial diseases
Tuberculosis remains nationally notifiable with 8,300 to 9,600 cases reported annually. The US TB rate is among the lowest in the world for a large country, but TB elimination remains elusive; foreign-born individuals account for approximately 70 percent of US cases. Drug-resistant TB — multidrug-resistant (MDR-TB) and extensively drug-resistant (XDR-TB) — requires much longer and more toxic treatment regimens and receives heightened surveillance attention. Legionellosis (Legionnaires' disease) generates approximately 10,000 reported cases annually and has been increasing steadily, likely reflecting both improved diagnostic testing and aging water infrastructure. Meningococcal disease, while rare at several hundred cases per year, warrants close surveillance due to its high case fatality rate and the availability of effective vaccines.
Emerging and priority conditions
NNDSS demonstrated its capacity to absorb novel pathogens during the 2022 mpox outbreak, when the condition was added as an emergency notifiable disease as over 31,000 US cases were identified within months — the largest mpox outbreak ever recorded outside endemic African regions. Transmission occurred predominantly through close sexual contact in networks of men who have sex with men, and the outbreak was substantially controlled through a combination of vaccination, behavioral change, and community outreach.
COVID-19 was declared a public health emergency in January 2020 and added to NNDSS as an emergency notifiable condition. Individual COVID-19 case-level surveillance data was published at data.cdc.gov — deidentified records covering 2020–2023 with over 100 million records at peak. Following the end of the federal public health emergency in May 2023, COVID-19 individual case reporting to CDC ended; surveillance shifted to aggregate sentinel systems including laboratory test positivity, wastewater detection through the National Wastewater Surveillance System, and emergency department syndromic surveillance. If an Ebola case were imported to the US, it would trigger immediate emergency NNDSS designation and reporting through the same rapid-response infrastructure.
Case investigation and surveillance infrastructure
NNDSS case counts depend on a multi-layer infrastructure that extends far beyond the NNDSS reporting pipeline itself. Case definitions — the clinical, laboratory, and epidemiological criteria that define a reportable case — are developed by CSTE working groups and specify confirmed, probable, and suspect classifications. A confirmed Lyme disease case requires both compatible clinical manifestations and specific laboratory evidence; a probable case may rely on exposure history and clinical presentation without laboratory confirmation. These definitional tiers affect comparability across time (case definitions are updated annually) and across jurisdictions (states vary in how diligently they investigate possible cases).
Electronic lab reporting has substantially accelerated and automated case detection for laboratory-diagnosed conditions. When a clinical laboratory detects a positive STEC culture, the laboratory information management system transmits an ELR message directly to the state health department. State systems match the ELR message against existing case investigations, create new cases, and route them to local health departments for follow-up. The transition to ELR has reduced both the lag from diagnosis to case report and the burden on provider office staff who previously completed paper or phone reports manually.
PulseNet is the CDC-coordinated molecular surveillance network for foodborne pathogens operating in parallel with NNDSS case reporting. Clinical and environmental laboratories across the country submit whole genome sequences (WGS) of bacterial isolates from foodborne illness cases — primarily Salmonella, Listeria, STEC, and Campylobacter — to a national database. PulseNet algorithms identify genetic clusters that indicate cases are linked by a common contaminated food source even when patients are geographically dispersed and would not otherwise be recognized as part of an outbreak. PulseNet cluster detection has been the initiating event for most major multistate foodborne outbreak investigations in recent years. The FDA's GenomeTrakr network extends similar whole-genome sequencing surveillance to food production environments and the regulatory food safety context.
The BioSense Platform and ESSENCE (Electronic Surveillance System for the Early Notification of Community-based Epidemics) provide a complementary layer of syndromic surveillance that operates upstream of diagnosis. Rather than waiting for laboratory confirmation, BioSense ingests chief complaint and discharge diagnosis data from emergency departments in near-real-time. Spikes in chief complaints coded as “fever and rash,” “gastrointestinal illness,” or “respiratory infection” can signal emerging outbreaks days to weeks before confirmatory laboratory data reaches NNDSS. Syndromic surveillance proved valuable during COVID-19, mpox, and multiple influenza seasons for detecting geographic spread ahead of diagnostic confirmation capacity.
COVID-19 and the evolution of federal disease surveillance
COVID-19 exposed both the strengths and the limitations of the US disease surveillance architecture at scale. The initial response demonstrated that the federal system could rapidly designate a novel condition as nationally notifiable and stand up reporting infrastructure within weeks. But it also revealed fragmentation: case reports arrived at CDC from 50+ distinct state and territorial systems with varying data quality, field definitions, and reporting frequencies, creating significant analytical challenges during the critical early months of the pandemic.
COVID-19 case surveillance data published at data.cdc.gov represented the largest federal public health dataset ever released — over 100 million deidentified case-level records covering demographics, outcomes, hospitalization, and death across 2020–2023. The dataset required substantial deidentification and suppression to protect patient privacy while preserving analytical utility; small cell counts were suppressed, exact dates were shifted, and certain geographic fields were removed or aggregated.
The National Immunization Survey (NIS) expanded substantially during the pandemic to track COVID-19 vaccination coverage alongside existing childhood and influenza vaccination surveillance. Post-PHE, the transition from individual case reporting to aggregate sentinel surveillance required redesigning the measurement infrastructure: the current framework relies on sentinel laboratory networks for test positivity, the National Wastewater Surveillance System for environmental signal, emergency department data for severity indicators, and genomic sequencing networks for variant tracking. This tiered approach accepts lower sensitivity for individual case detection in exchange for operational sustainability at endemic disease levels.
Data limitations and the undercount problem
NNDSS counts reported cases only. For most conditions, reported cases represent a small fraction of actual disease burden. The magnitude of underreporting varies by condition, testing rate, and jurisdiction in ways that complicate cross-condition comparisons and trend analysis.
The CDC's foodborne illness attribution modeling illustrates the scale of undercount. For Salmonella, CDC estimates approximately 1 in 29 actual infections is confirmed and reported; for Campylobacter, the multiplier is approximately 1 in 24. For Listeria — a severe, hospitalization-requiring illness — the undercount multiplier is much lower (roughly 2 to 3), because the severity of illness drives care-seeking and testing. HIV surveillance is among the most complete in the system, with well-funded infrastructure and high clinical awareness; the primary undercount for HIV is undiagnosed cases rather than diagnosed-but-not-reported cases.
Surveillance completeness also varies dramatically by condition type. Viral gastroenteritis — the most common infectious illness in the US by volume — is not nationally notifiable and essentially absent from NNDSS. Influenza has its own separate sentinel surveillance system. The nationally notifiable conditions list covers a curated subset of pathogens for which the public health value of national reporting justifies the system cost — it is emphatically not a comprehensive picture of infectious disease burden. The 2020 COVID-19 pandemic disrupted routine surveillance for many other conditions, as clinical capacity was redirected and health-seeking behavior changed dramatically; case counts for non-COVID conditions dropped sharply in 2020 and have recovered at varying rates, complicating trend analysis for the 2019–2023 period.
Data access
NNDSS data is accessible through several public interfaces at varying levels of geographic and case-level detail.
The primary public-facing query interface is CDC WONDER NNDSS at wonder.cdc.gov/nndss/, which provides aggregate case counts by condition and jurisdiction. WONDER does not expose individual case-level data; queries return counts by selected stratifiers, with small-cell suppression applied for counts under five. The WONDER system supports both browser-based queries and a programmatic data request API at wonder.cdc.gov/controller/datarequest/D77 that accepts GET and POST requests with XML-formatted query specifications.
The data.cdc.gov Socrata platform publishes NNDSS weekly data by disease and jurisdiction at dataset ID 9bhg-hcku. The endpoint accepts Socrata SoQL query parameters for filtering, aggregation, and pagination. Fields include mmwr_year, mmwr_week, condition, location1 (state/jurisdiction), and current_week (case count for that week). The Morbidity and Mortality Weekly Report (MMWR) publishes weekly surveillance summaries of notifiable disease counts as both narrative reports and accompanying data tables. ArboNET, the CDC vector-borne disease surveillance system, publishes arboviral disease case data with geographic and species breakdowns separately. STI surveillance is published annually in the CDC's Sexually Transmitted Disease Surveillance report, with state-level data tables available alongside the report.
Python analysis: case trends and emerging signals
The CDC NNDSS weekly dataset on data.cdc.gov is the most accessible programmatic entry point for NNDSS analysis. The Socrata endpoint supports SoQL query parameters including $select for computed aggregations, $where for filtering, $group for group-by aggregation, and $order for sorting. Because the dataset contains millions of weekly rows across all conditions and jurisdictions, server-side aggregation via $select and $groupis essential for performance — pulling the raw row-level data and aggregating locally would require fetching gigabytes of records.
The following script fetches five years of case data for the top 10 notifiable conditions by volume, computes year-by-year trends, identifies states with highest per-capita rates for Lyme disease and syphilis, and flags any conditions showing greater than 20 percent year-over-year increase as potential emerging signals.
import requests
import pandas as pd
from collections import defaultdict
# -------------------------------------------------------
# CDC NNDSS Weekly Disease Surveillance Analysis
#
# This script:
# 1. Fetches CDC NNDSS weekly case data from data.cdc.gov
# Socrata API (dataset: 9bhg-hcku)
# 2. Computes 5-year case trend for top 10 conditions
# 3. Identifies states with highest per-capita rates for
# Lyme disease and syphilis
# 4. Flags conditions showing >20% YoY increase as
# potential emerging signals
# -------------------------------------------------------
NNDSS_ENDPOINT = "https://data.cdc.gov/resource/9bhg-hcku.json"
# Census 2022 state population estimates
CENSUS_POP_2022 = {
"ALABAMA": 5074296, "ALASKA": 733583, "ARIZONA": 7359197,
"ARKANSAS": 3045637, "CALIFORNIA": 39029342, "COLORADO": 5839926,
"CONNECTICUT": 3626205, "DELAWARE": 1018396, "FLORIDA": 22244823,
"GEORGIA": 10912876, "HAWAII": 1440196, "IDAHO": 1939033,
"ILLINOIS": 12582032, "INDIANA": 6833037, "IOWA": 3200517,
"KANSAS": 2937150, "KENTUCKY": 4512310, "LOUISIANA": 4590241,
"MAINE": 1385340, "MARYLAND": 6164661, "MASSACHUSETTS": 6981974,
"MICHIGAN": 10034113, "MINNESOTA": 5717184, "MISSISSIPPI": 2940057,
"MISSOURI": 6177957, "MONTANA": 1122867, "NEBRASKA": 1967923,
"NEVADA": 3177772, "NEW HAMPSHIRE": 1395231, "NEW JERSEY": 9261699,
"NEW MEXICO": 2113344, "NEW YORK": 19677151, "NORTH CAROLINA": 10698973,
"NORTH DAKOTA": 779261, "OHIO": 11756058, "OKLAHOMA": 4019800,
"OREGON": 4240137, "PENNSYLVANIA": 12972008, "RHODE ISLAND": 1093734,
"SOUTH CAROLINA": 5282634, "SOUTH DAKOTA": 909824, "TENNESSEE": 7051339,
"TEXAS": 30029572, "UTAH": 3380800, "VERMONT": 647464,
"VIRGINIA": 8683619, "WASHINGTON": 7785786, "WEST VIRGINIA": 1775156,
"WISCONSIN": 5892539, "WYOMING": 581381,
"DISTRICT OF COLUMBIA": 671803,
}
def fetch_nndss_paginated(params: dict, page_size: int = 5000) -> list[dict]:
"""Fetch all records from CDC NNDSS Socrata endpoint using offset pagination."""
records = []
offset = 0
while True:
paged = {**params, "\$limit": page_size, "\$offset": offset}
resp = requests.get(NNDSS_ENDPOINT, params=paged, timeout=60)
resp.raise_for_status()
batch = resp.json()
if not batch:
break
records.extend(batch)
print(f" Fetched {len(records):,} records so far...")
if len(batch) < page_size:
break
offset += page_size
return records
# -------------------------------------------------------
# Step 1: Fetch 5 years of aggregate case counts by disease
# and year (2019-2023, national level)
# -------------------------------------------------------
print("Fetching NNDSS weekly case data (2019-2023, all conditions)...")
# The NNDSS Socrata dataset includes mmwr_year, mmwr_week, condition,
# location1 (state), and current_week (case count for that week).
trend_params = {
"\$select": "mmwr_year, condition, SUM(current_week) AS annual_cases",
"\$where": "mmwr_year >= 2019 AND mmwr_year <= 2023 AND current_week IS NOT NULL",
"\$group": "mmwr_year, condition",
"\$order": "mmwr_year ASC, condition ASC",
}
trend_records = fetch_nndss_paginated(trend_params)
trend_df = pd.DataFrame(trend_records)
trend_df["mmwr_year"] = pd.to_numeric(trend_df["mmwr_year"], errors="coerce")
trend_df["annual_cases"] = pd.to_numeric(trend_df["annual_cases"], errors="coerce").fillna(0)
# -------------------------------------------------------
# Step 2: Top 10 conditions by total 5-year case volume
# -------------------------------------------------------
condition_totals = (
trend_df.groupby("condition")["annual_cases"]
.sum()
.sort_values(ascending=False)
)
top10 = condition_totals.head(10).index.tolist()
print("\nTop 10 Nationally Notifiable Conditions by 5-Year Case Volume (2019-2023)")
print("-" * 65)
print(f"{\'Condition\':<45} {\'Total Cases\':<12}")
print("-" * 65)
for cond in top10:
total = int(condition_totals[cond])
print(f"{cond:<45} {total:>12,}")
# Year-by-year breakdown for top 10
top10_df = trend_df[trend_df["condition"].isin(top10)].copy()
pivot_top10 = top10_df.pivot_table(
index="condition", columns="mmwr_year", values="annual_cases", aggfunc="sum"
).fillna(0).astype(int)
pivot_top10 = pivot_top10.reindex(top10)
print("\nYear-by-Year Case Counts for Top 10 Conditions")
print("-" * 80)
years = sorted(top10_df["mmwr_year"].dropna().unique().astype(int).tolist())
print(f"{\'Condition\':<40}", end="")
for y in years:
print(f" {y:>8}", end="")
print()
print("-" * 80)
for cond in top10:
print(f"{cond:<40}", end="")
for y in years:
val = pivot_top10.loc[cond, y] if y in pivot_top10.columns else 0
print(f" {int(val):>8,}", end="")
print()
# -------------------------------------------------------
# Step 3: Per-capita Lyme disease and syphilis by state
# (most recent year available: 2023)
# -------------------------------------------------------
print("\nFetching state-level Lyme and syphilis counts for 2023...")
lyme_syphilis_params = {
"\$select": "location1, condition, SUM(current_week) AS total_cases",
"\$where": (
"mmwr_year = 2023 "
"AND current_week IS NOT NULL "
"AND (condition LIKE \'%Lyme%\' OR condition LIKE \'%Syphilis%\')"
),
"\$group": "location1, condition",
"\$order": "condition ASC, location1 ASC",
}
ls_records = fetch_nndss_paginated(lyme_syphilis_params)
ls_df = pd.DataFrame(ls_records)
ls_df["total_cases"] = pd.to_numeric(ls_df["total_cases"], errors="coerce").fillna(0)
ls_df["location1"] = ls_df["location1"].str.upper().str.strip()
ls_df["pop"] = ls_df["location1"].map(CENSUS_POP_2022)
ls_df = ls_df.dropna(subset=["pop"])
ls_df["rate_per_100k"] = ((ls_df["total_cases"] / ls_df["pop"]) * 100_000).round(2)
for disease_filter in ["Lyme", "Syphilis"]:
subset = ls_df[ls_df["condition"].str.contains(disease_filter, case=False, na=False)]
if subset.empty:
print(f"\nNo data found for conditions containing \'{disease_filter}\'")
continue
# If multiple condition variants, sum them per state
state_totals = (
subset.groupby("location1")
.agg(total_cases=("total_cases", "sum"), pop=("pop", "first"))
.reset_index()
)
state_totals["rate_per_100k"] = (
(state_totals["total_cases"] / state_totals["pop"]) * 100_000
).round(2)
ranked = state_totals.sort_values("rate_per_100k", ascending=False).head(15).reset_index(drop=True)
ranked.index += 1
print(f"\nTop 15 States: {disease_filter} Disease Rate per 100,000 (2023)")
print("-" * 60)
print(f"{\'Rank\':<6} {\'State\':<28} {\'Cases\':<8} {\'Rate/100k\':<10}")
print("-" * 60)
for rank, row in ranked.iterrows():
print(
f"{rank:<6} {row[\'location1\']:<28} "
f"{int(row[\'total_cases\']):>8,} {row[\'rate_per_100k\']:>10.2f}"
)
# -------------------------------------------------------
# Step 4: Flag conditions with >20% YoY increase (2022->2023)
# as potential emerging signals
# -------------------------------------------------------
print("\nScanning for conditions with >20% increase from 2022 to 2023...")
yoy = trend_df[trend_df["mmwr_year"].isin([2022, 2023])].copy()
yoy_pivot = yoy.pivot_table(
index="condition", columns="mmwr_year", values="annual_cases", aggfunc="sum"
).fillna(0)
emerging = []
for cond, row in yoy_pivot.iterrows():
cases_2022 = row.get(2022, 0)
cases_2023 = row.get(2023, 0)
if cases_2022 >= 100: # ignore very small-count conditions
pct_change = ((cases_2023 - cases_2022) / cases_2022) * 100
if pct_change > 20:
emerging.append({
"condition": cond,
"cases_2022": int(cases_2022),
"cases_2023": int(cases_2023),
"pct_change": round(pct_change, 1),
})
emerging.sort(key=lambda x: x["pct_change"], reverse=True)
if emerging:
print(f"\nConditions with >20% year-over-year increase (2022->2023):")
print("-" * 70)
print(f"{\'Condition\':<40} {\'2022\':<8} {\'2023\':<8} {\'Change%\':<10}")
print("-" * 70)
for e in emerging:
print(
f"{e[\'condition\']:<40} {e[\'cases_2022\']:>8,} "
f"{e[\'cases_2023\']:>8,} {e[\'pct_change\']:>9.1f}%"
)
else:
print("No conditions met the >20% increase threshold with >= 100 baseline cases.")
A practical note on the NNDSS Socrata dataset: condition name strings are not fully standardized across years. The same condition may appear as “Lyme disease,” “Lyme Disease,” or with CSTE case definition year appended in certain vintages. When building production pipelines, fetching the distinct condition list first and normalizing to lowercase for matching is more robust than exact-string filtering in the API query. Additionally, current_week values of zero or null require filtering; null values indicate weeks where no cases were reported (or data was not submitted) rather than suppressed small counts, and should be treated as zero for trend calculations.
The 20 percent year-over-year emerging signal threshold is a starting point, not a definitive criterion. Many conditions show large percentage swings on small absolute bases — a condition with 50 cases in one year and 65 in the next shows 30 percent growth but is unlikely to represent an actionable emerging threat. Applying a minimum baseline case count (100 cases used in the script above) reduces false positives from small-count noise. For genuine emerging pathogen detection, year-over-year percentage change in NNDSS data should be triangulated against syndromic surveillance signals, laboratory network trends, and geographic clustering before escalating to outbreak investigation.