No single federal file tells you how Americans die. The leading-causes ranking gives you the headline order, but buries the overdose epidemic inside a broad accidents line. The injury file decomposes that line but does not rank it against heart disease. Suicide, overdose, and excess deaths each answer their own question and only their own. Yet all five datasets are cut from the same cloth—the death certificates of the National Vital Statistics System—so when you understand how they slice it, they fit together into the single picture none of them shows alone.
This article is a guide to assembling that picture from federal data. It covers the common foundation—every US death generates a certificate whose single underlying cause is coded to ICD-10, and CDC's National Center for Health Statistics aggregates those records into different cuts; the five datasets and exactly which question each answers— the leading-causes ranking, the injury and external-cause records, the suicide-specific series, drug-overdose mortality, and excess deaths; the shared join keys (year, state or county, age group, sex, and the ICD-10 cause grouping) and why age-adjustment to the year-2000 standard population is what makes any of them comparable; the hard part— aligning the five datasets' cause groupings, which are sliced by rankable cause, by external-cause code, by intent, and against an expected baseline; the long-run story the assembly tells that no single file does, from the dominance of heart disease and cancer through the rise of the deaths of despair to the COVID-era spike; a Python workflow that pulls a cause across years from each relevant source and charts the trajectories together; and the caveats—single-cause reduction, coding changes, small counts, and provisional recent years—that travel with every one of the five.
The common foundation: one death certificate, many cuts
The reason these five datasets can be assembled at all is that they are not five independent collections—they are five views of one underlying record. The National Vital Statistics System (NVSS) is the decentralized arrangement through which the federal government compiles the country's deaths. Vital registration is a state function: a death certificate is completed locally—a physician, medical examiner, or coroner certifies the cause, a funeral director records the demographic facts—and registered with the state, which then provides the record to NCHS for standardization, coding, and aggregation. NVSS aims to capture every death that occurs in the United States, which is what makes its products counts rather than the survey estimates that characterize much of federal health data. Every one of the five datasets in this guide draws from that same census of death.
The pivot of the whole system is the assignment of a single underlying cause to each death, coded to the International Classification of Diseases, Tenth Revision (ICD-10), which has governed US underlying-cause coding since 1999. A certificate's medical section lists the immediate cause, the conditions leading to it, the underlying condition that began the sequence, and other contributing conditions; NCHS applies the World Health Organization's coding rules to select the one underlying cause—the disease or injury that initiated the train of events leading directly to death. From that single coded cause, NCHS produces the different cuts mechanically: it ranks the certificates by the NCHS list of rankable causes for the leading-causes view; it pulls the external-cause (injury) codes for the injury and overdose files; it filters on intent for the suicide series; and it compares the total death count to an expected baseline for excess deaths. Same source, same coding, different tabulation—which is precisely why they can be rejoined.
The five datasets and which question each answers
In our database the five views are stored as cdc_leading_causes, cdc_injury_mortality, cdc_suicide, cdc_overdose, and cdc_excess_deaths. Each is parsed and keyed by year and geography, so the assembly work is aligning their cause definitions and age-adjustment rather than parsing five separate NCHS releases from scratch. The discipline of the whole exercise is knowing which dataset answers which question, because they overlap and the wrong file gives a confidently wrong answer.
Leading causes answers “what kills the most people, in what order?” It is the most aggregated view: per state and year, the top rankable causes—heart disease, cancer, accidents, chronic lower respiratory disease, stroke, Alzheimer's, diabetes—each with a count and an age-adjusted rate. Its strength is the ranking across causes; its weakness is that it buckets all unintentional mechanisms into one “accidents” line. Injury mortality answers “within external causes, what mechanism and what intent?” It classifies the external-cause deaths—the ones the ranking hides inside accidents—by mechanism (firearm, motor vehicle, poisoning, fall) and by intent (unintentional, suicide, homicide, undetermined). Suicideanswers “the intentional-self-harm slice, in full detail”—by method, demographic, and over a long historical span, a single line in the ranking expanded into its own series. Drug-overdose mortality answers “the poisoning slice, by drug class and intent”—the file that isolates the opioid and, later, the synthetic-opioid surge that drove the accidents line up the ranking. And excess deaths answers a different question entirely—“how many more people died than a historical baseline predicts, regardless of cause?”—the only one of the five not organized by cause at all.
Shared join keys and the columns that align them
Because the five files share the NVSS foundation, they share the columns that let you join them. The common keys are year, the geography (state or, in the finer files, county), age group, sex, and the ICD-10 cause grouping—and almost every file carries an age-adjusted rate alongside the raw count. The columns below are the recurring backbone an analyst aligns across the assembly; individual files add their own slice column (a rankable-cause label, an injury intent, a drug class), but these are what they have in common:
year -- calendar year of the deaths (the primary time key)
geography -- state name, "United States", or county/FIPS in finer files
age_group -- standard NCHS age band (e.g. 25-34), the adjustment unit
sex -- decedent sex, where the file is stratified by it
cause_grouping -- the file's cause slice: rankable cause, external-cause
-- code, intent, drug class, or "all causes" for excess
deaths -- number of deaths in the cell (the raw count)
age_adjusted_rate -- deaths per 100,000, standardized to the 2000 US population
crude_rate -- deaths per 100,000 of the actual population, unadjusted
population -- the denominator population for the cell
expected / baseline -- excess-deaths only: the predicted count for the periodThe year and geography keys are what let the leading-causes ranking sit on the same axis as the overdose trajectory and the excess-deaths curve. The age_adjusted_rate is the column that makes any of them comparable across states and decades, for the reason developed below. And the cause_grouping is the one that demands the most care, because it is precisely the column that differs in kind across the five files—a rankable-cause name in one, an external-cause code in another, an intent flag in a third—so aligning it is the substance of the assembly rather than a mechanical join.
Why age-adjustment is what makes any of them comparable
The single methodological concept that runs through all five files, and the one most often misread, is age-adjustment. Death is overwhelmingly a function of age: an older population has more deaths, and more deaths from age-associated diseases, than a younger one—not because it is less healthy but because more of its members are old. A state with an old age structure will post higher crude death rates than a young state for reasons that have nothing to do with the quality of its health system. Worse for assembly across time, the US population itself has aged over the decades, so a rising crude rate can reflect demographic drift rather than rising risk. Comparing crude rates across places or years therefore largely measures the difference in age structures, not the difference in mortality risk.
Age-adjustment removes that confound, and where a file provides it, it does so the same way—which is exactly what lets those rates be placed on a common axis. The technique computes rates within narrow age bands and re-weights them to a common standard population—for US mortality statistics, the year-2000 US standard population—so that every state and every year is evaluated as if it had the same age distribution. The age-adjusted death rate answers the counterfactual: if this population had the standard age structure, what would its death rate from this cause be? When the leading-causes rate, the overdose rate, and the suicide rate are all standardized to the same year-2000 population, the chronic-disease trajectory and the deaths-of-despair trajectory can be compared directly—decline in one and rise in the other measured against the same demographic yardstick. The discipline is to compare like with like: an age-adjusted rate against another age-adjusted rate, never an age-adjusted rate against a crude or all-ages rate, and never two crude rates across the decades—the most common error in reading mortality data and the fastest way to corrupt the assembly. (As the workflow below notes, some public extracts publish only an all-ages rate; aligning those to the age-adjusted series requires pulling the age-adjusted cut from CDC WONDER rather than mixing the two.)
The hard part: aligning five different cause groupings
If the join keys were the whole story, this would be a trivial merge. The actual work— and the reason a guide is needed—is that the five files slice cause along four incompatible axes, and reconciling them is the substance of the assembly. The leading-causes file uses the NCHS list of rankable causes, broad mutually exclusive categories chosen to be stable enough to rank year over year. The injury and overdose files use external-cause (V–Y) codes, classifying deaths by the mechanism and the circumstance of the injury. The suicide file filters on intent—intentional self-harm—a slice that cuts across mechanisms. And the excess-deaths file is not organized by cause at all; it compares all deaths to an expected baseline.
The misalignment is not cosmetic—the same death lands in different boxes in different files. A fatal drug overdose appears in the leading-causes file inside the broad “accidents” (unintentional injury) line; in the injury file as a poisoning death with an intent; in the overdose file broken out by drug class; and, if it was intentional, in the suicide file as well. A firearm suicide is a single death that is simultaneously a suicide (by intent), a firearm death (by mechanism), and—because suicide is itself a rankable cause—a line in the ranking. Assembling the files therefore means deciding, explicitly, how the rankable categories map onto the external-cause codes and the intents, so that you do not double-count a death that two files both contain or miss one that only the finer file reports. The leading-causes accidents line, properly decomposed, equals the unintentional slice of the injury file; the suicide rankable cause equals the intentional-self-harm slice of the injury file equals the suicide file; the overdose file is the poisoning mechanism of the injury file, further split by drug. Drawing those equivalences correctly is the craftsmanship the assembly demands, and getting them wrong is how an analyst ends up counting the overdose epidemic twice or attributing it to the wrong cause.
The story the assembly tells that no single file does
Assembled correctly, the five files narrate the modern history of American mortality in a way that no one of them can. The opening movement is the slow dominance and slow retreat of chronic disease: heart disease and cancer have held the top two positions of the ranking for the entire modern era, together accounting for a large share of all deaths—but their age-adjusted rates have drifted downward for decades, the fruit of declining smoking, better treatment of cardiovascular disease, and earlier cancer detection. Read off the leading-causes file alone, that is a story of steady, if uneven, progress at the top of the table.
The counter-melody is the one the cause-specific files supply: the rise of the so-called deaths of despair. Through the 2000s and 2010s, suicide and drug-overdose mortality climbed—the suicide series shows a steady, broad-based increase across most demographic groups, and the overdose file shows a steeper climb that passed through distinct waves, from prescription opioids to heroin and then, most sharply, into the synthetic-opioid era as illicitly manufactured fentanyl drove overdose deaths to levels that reshaped the entire mortality landscape for younger adults. In the leading-causes file this surge is visible only as the broad accidents category climbing the ranking; it is the injury and overdose files that reveal it for what it is. Laid against the gently declining chronic-disease lines, the rising despair lines are the single most important divergence in recent US mortality—and it is only visible when the files are assembled.
The final movement belongs to the fifth file. The COVID-19 pandemic produced a mortality spike that the cause-coded files capture imperfectly—COVID-19 entered the leading-causes ranking near the top in its peak years, but a portion of the pandemic's death toll was never coded to the disease itself, lost to misattribution, untested infections, and the indirect deaths from disrupted care. The excess-deaths file isolates exactly that total burden: by comparing observed deaths to the historical baseline, it measures how many more Americans died than expected, regardless of cause, and so quantifies the pandemic's full impact in a way the cause counts cannot. The leading-causes ranking tells you the order; the injury, suicide, and overdose files tell you what is moving inside the categories; and excess deaths tells you the total. Together they are the complete federal account of how the country dies.
Analytical uses of the assembled picture
A reconciled, age-adjusted, year-and-geography-keyed assembly of the five files supports analyses that no single one can, because its strength is the comparison across causes, files, places, and years.
Charting competing trajectories on one axis is the headline use: placing the declining chronic-disease rates and the rising deaths-of-despair rates on the same age-adjusted scale to see the divergence directly, the analysis the workflow below performs. Decomposing a moving rank exploits the layering—when accidents climbs the ranking, the injury and overdose files answer whether it is crashes, falls, or poisonings, and the overdose file answers which drug class, turning a vague top-level movement into a specific, attributable cause. Reconciling the total against the parts uses excess deaths as a check on the cause-coded files: in a year when excess deaths exceed the sum of the coded cause increases, the gap itself is evidence of under-attribution—the pandemic's uncounted deaths, an undercounted overdose wave—a question only the assembly can pose. Finally, geographic and demographic equity analysis brings the shared state, age-group, and sex keys to bear: aligning the overdose surge, the suicide trend, and the chronic-disease geography across the same states and demographics surfaces where multiple mortality crises compound in the same populations—the question that motivates the whole exercise of assembling the files.
Python workflow: aligning the files on one trajectory
The script below pulls a cause across years from the leading-causes resource and the injury resource on CDC's public Socrata API at data.cdc.gov, aligns them on year, and assembles a single panel that places the stable chronic-disease summit (heart disease and cancer) alongside the deaths-of-despair series (suicide and drug overdose). It is the minimal version of the assembly this article describes: same time key, different files reconciled into one trajectory. One comparability caveat is built into the code and worth stating plainly here— the leading-causes resource ships an age-adjusted rate, while the public injury resource exposes an all-ages rate, so the despair line is directionally right but not standardized to the year-2000 population; CDC WONDER's underlying-cause query system is the route to a rigorously age-adjusted injury series and to the multiple-cause and county-level cuts the published resources do not pre-compute. No API key is required for modest volumes. Because NCHS field names and dataset identifiers vary between releases, the script resolves column names defensively and isolates the dataset ids in one place; any production use should be validated against the current data.cdc.gov catalog.
import requests
import pandas as pd
# Every CDC mortality dataset below is built from the same National Vital
# Statistics System death certificates, so they share join keys: year,
# geography, and an ICD-10 cause grouping. This script pulls one long-run
# trajectory from each of two genuine public resources on data.cdc.gov
# (Socrata, no key for modest volumes) and aligns them on year to chart
# the chronic-disease summit against the deaths-of-despair series.
#
# Important on comparability: the Leading Causes resource ships the
# age-adjusted death rate (aadr, standardized to the year-2000 US
# standard population). The public Injury Mortality resource exposes an
# all-ages rate (age_specific_rate at age_years == "All Ages"), which is
# NOT year-2000-standardized. For a rigorously age-adjusted injury series
# use CDC WONDER’s underlying-cause query system; the all-ages rate here
# is the directionally correct, key-free trajectory the resource offers.
SODA = "https://data.cdc.gov/resource"
# 4x4 Socrata dataset ids change across NCHS releases; isolate them here
# and confirm against the current data.cdc.gov catalog before relying on
# any one. bi63-dtpu is the NCHS "Leading Causes of Death: United States"
# (carries state and aadr); nt65-c7a7 is "Injury Mortality: United
# States" (national only -- no state column -- sliced by injury_intent
# and injury_mechanism, with age_specific_rate by age_years).
LCOD = "bi63-dtpu" # leading causes (broad ranking, age-adjusted)
INJURY = "nt65-c7a7" # injury mortality (national, intent x mechanism)
def fetch(dataset, where=None, select=None, limit=50000):
params = {"$limit": limit}
if where:
params["$where"] = where
if select:
params["$select"] = select
r = requests.get(f"{SODA}/{dataset}.json", params=params, timeout=120)
r.raise_for_status()
return pd.DataFrame(r.json())
def _col(df, *names):
# NCHS field names vary by release; return the first one present.
for n in names:
if n in df.columns:
return n
return None
# --- A national age-adjusted-rate trajectory from the ranking ----------
# The leading-causes resource carries a state column and an age-adjusted
# rate, so filter to the "United States" aggregate and one rankable cause.
def lcod_trajectory(cause):
df = fetch(LCOD)
if df.empty:
return pd.Series(dtype=float)
yr = _col(df, "year")
st = _col(df, "state", "geography", "jurisdiction")
ca = _col(df, "cause_name", "leading_cause", "cause")
ra = _col(df, "aadr", "age_adjusted_death_rate", "age_adjusted_rate")
df["_y"] = pd.to_numeric(df[yr], errors="coerce")
df["_r"] = pd.to_numeric(df[ra], errors="coerce")
sub = df[df[st].astype(str).str.lower().eq("united states")
& df[ca].astype(str).str.contains(cause, case=False, na=False)]
return sub.groupby("_y")["_r"].mean().sort_index()
# --- A national all-ages-rate trajectory from the injury file ----------
# This resource is national (no state column) and slices cause along two
# columns: injury_intent (Unintentional/Suicide/Homicide/...) and
# injury_mechanism (Firearm/Poisoning/Fall/...). Suicide is an intent;
# drug overdose is the "Poisoning" mechanism. Read the all-ages band.
def injury_trajectory(intent="All Intentions", mechanism="All Mechanisms"):
df = fetch(INJURY)
if df.empty:
return pd.Series(dtype=float)
yr = _col(df, "year")
it = _col(df, "injury_intent", "intent")
me = _col(df, "injury_mechanism", "mechanism")
ag = _col(df, "age_years", "age_group", "age")
ra = _col(df, "age_specific_rate", "rate", "age_adjusted_rate")
df["_y"] = pd.to_numeric(df[yr], errors="coerce")
df["_r"] = pd.to_numeric(df[ra], errors="coerce")
sub = df[df[it].astype(str).str.lower().eq(intent.lower())
& df[me].astype(str).str.lower().eq(mechanism.lower())
& df[ag].astype(str).str.lower().eq("all ages")]
return sub.groupby("_y")["_r"].mean().sort_index()
# Heart disease and cancer (the stable summit, age-adjusted) vs. the
# deaths-of-despair series: suicide (intent) plus drug overdose, which is
# the Poisoning mechanism, both aligned on year as all-ages rates.
heart = lcod_trajectory("Heart disease")
cancer = lcod_trajectory("Cancer")
despair = injury_trajectory(intent="Suicide").add(
injury_trajectory(mechanism="Poisoning"), fill_value=0)
panel = pd.concat({"heart": heart, "cancer": cancer,
"despair": despair}, axis=1).dropna(how="all")
print("Death rate per 100,000, United States, by year")
print("(heart/cancer age-adjusted; despair all-ages):")
for y, row in panel.iterrows():
h, c, d = row.get("heart"), row.get("cancer"), row.get("despair")
h = float("nan") if h is None else h
c = float("nan") if c is None else c
d = float("nan") if d is None else d
print(f" {int(y)} heart={h:6.1f} cancer={c:6.1f} despair={d:6.1f}")
# The trajectory the assembly reveals that no single file does: heart
# disease and cancer drifting down for decades while the deaths-of-
# despair line climbs through the 2000s and 2010s into the synthetic-
# opioid surge -- with the COVID years isolated by the excess-deaths file.
Two practical notes apply. First, the cause-grouping alignment in the script is deliberately simplified—it matches the deaths-of-despair series by summing the suicide intent and the poisoning mechanism of the injury file, which is the right shape but not the rigorous reconciliation. A production assembly must map the rankable categories onto the external-cause codes and intents explicitly, so that the unintentional poisoning deaths counted in the overdose file are not also counted in the suicide line, and so that a death the injury file resolves more finely is attributed once and only once. Second, excess deaths cannot be merged on cause at all, because it is not a cause-coded series; it is joined to the others purely on year and geography and used as the total-burden check described above, never as another cause line in the same panel. Keeping the cause-coded files and the baseline-comparison file in separate logical layers—one for the composition, one for the total—is what keeps the assembly coherent.
Limitations and analytical caveats
Because the five files share one foundation, they share its limitations—and an analyst assembling them inherits every caveat at once, before adding the new ones the assembly itself introduces.
Every death is reduced to one underlying cause. The single-cause rule that makes the ranking coherent also means the cause-coded files cannot see comorbidity: a death certified to heart disease in a person who also had diabetes counts only against heart disease, and the contributing conditions vanish. Diseases that more often appear as contributing than underlying causes—diabetes is the classic case—are systematically understated in every cut except a multiple-cause analysis of the underlying NVSS files. The assembly inherits this for all five datasets at once.
Coding changes and certification practices shape every file. The 1999 ICD-10 transition broke direct comparability for some causes, so any trajectory that crosses it requires NCHS's comparability ratios before pre- and post-1999 rates share an axis. Within the ICD-10 era, certification practices evolve—improved recognition of Alzheimer's, shifting conventions for coding drug deaths, the introduction of new codes for emerging causes—so a movement in any of the five series can reflect a change in coding rather than in how often people die. The risk compounds in an assembly, because a coding change can shift a death from one file's box to another's and distort the alignment.
Small counts make fine cells unstable. The deeper the cut—a single drug class, in a single small state, in a single age band—the smaller the death count, and rates built on small numbers carry wide confidence intervals and swing on statistical noise. NCHS suppresses or flags rates resting on too few deaths for exactly this reason. The assembly is most prone to this where the finer files (overdose by drug, suicide by method) are pushed to the state-and-demographic level, precisely the cells an equity analysis most wants—so those cells must be read with their margins of error, not as point facts.
Recent years are provisional, and the data is not real-time. Mortality data is finalized only after a substantial lag—deaths must be registered, coded, and processed—so the most recent year or two in any release is provisional and will rise as late certificates arrive. Drug-overdose data in particular depends on toxicology and is among the slowest to finalize. In an assembly the lags are not uniform across the five files, so the leading edge of the combined trajectory can be distorted by files completing at different speeds; established multi-year trends are what the assembly is authoritative for, not the latest months.
Held with these caveats in mind, the five tables—cdc_leading_causes, cdc_injury_mortality, cdc_suicide, cdc_overdose, and cdc_excess_deaths—compose a resource no single one of them is: a year-and-geography-keyed, age-adjusted, reconciled account of how Americans die, the ranking telling you the order, the injury and suicide and overdose files telling you the composition inside the categories that move, and the excess-deaths measure telling you the total burden—the full picture of mortality in America that lives not in any one federal file but in their assembly.
Related writing
CDC Leading Causes of Death by State: The Federal Ranking of How Americans Die — The top-level view at the apex of the assembly, ranking the broad cause categories by state and year and supplying the order that the injury, suicide, and overdose files decompose from below.
CDC Injury Mortality: The Federal Record of How Americans Die from Firearms, Overdoses, and Crashes — The file that classifies external-cause deaths by mechanism and intent, turning the leading-causes accidents line into the specific, attributable mechanisms—crashes, falls, firearms, and the drug overdoses—that drove the deaths-of-despair surge.
CDC Excess Deaths: The Federal Measure of How Many More Americans Died Than Expected — The one file not organized by cause, joined to the others on year and geography as the total-burden check that captured the full mortality impact of the COVID-19 pandemic, including deaths never coded to the disease.