Every death in the United States is reduced, on its certificate, to a single underlying cause—the one disease or injury that set in motion the chain of events ending in death. The CDC's National Center for Health Statistics gathers those single causes, groups them into a fixed list of rankable causes, and publishes the answer to the bluntest public-health question there is: in each state, in each year, what kills the most people, and at what rate. That ranking—roughly 10,900 state-cause-year records of heart disease, cancer, accidents, and the rest of the top ten—is the top-level map of how Americans die.
This article covers what the leading-causes dataset is and how the National Vital Statistics System and ICD-10 produce it; how a death certificate's many conditions are collapsed into one underlying cause and then sorted into the NCHS list of rankable causes; why age-adjustment is the technical move that makes a young state and an old one comparable; the top-ten causes themselves and how the ranking has shifted over time as drug overdoses and Alzheimer's disease climbed; how the rankings differ across states and where age-adjusted rates run highest; how this dataset sits above the cause-specific mortality files on injury, suicide, and excess deaths as the headline view; a Python workflow that pulls the records from CDC's public APIs, ranks causes nationally and by state, and tracks how a cause's rank changed over time; and the caveats—coding changes, single-cause reduction, small-count instability, and provisional recent years—that every analyst must internalize before drawing conclusions.
What the dataset is
The leading-causes-of-death dataset is the annual ranking, for each state and the nation, of the causes that account for the most deaths. It is produced by the National Center for Health Statistics (NCHS), the CDC component that serves as the federal government's principal health-statistics agency, from the same raw material behind all of its mortality products: the death certificates filed in the National Vital Statistics System (NVSS). The output is deliberately compact. Rather than the full multi-cause detail of the underlying mortality files, the leading-causes record gives, per state and year, the top causes—heart disease, cancer, unintentional injury, chronic lower respiratory disease, stroke, Alzheimer's disease, diabetes, and the remainder of the top ten—each with a death count and an age-adjusted rate. It is the picture that appears in the headline “leading causes of death” tables and the data behind the familiar ranking the CDC publishes each year.
In our database this record is stored as the table cdc_leading_causes, with the grain of one row per state per cause per year: roughly 10,900 rows keyed by state, cause, and year. A single state contributes one row for each rankable cause in each year, so the same cause appears once per state per year and the table is, in effect, a long-format ranking that can be pivoted by any of its three keys. The columns capture the geography, the year, the cause, its rank, the count, and the age-adjusted rate that makes the counts comparable across populations of different ages:
state -- state name (or "United States" for the national row)
year -- calendar year of the deaths
cause_name -- short label from the NCHS rankable-cause list
icd10_113_group -- the NCHS 113-cause list grouping the cause maps to
rank -- the cause's rank within the state-year (1 = most deaths)
deaths -- number of deaths assigned to this cause
age_adjusted_death_rate -- deaths per 100,000, standardized to the 2000 US population
crude_rate -- deaths per 100,000 of the actual population, unadjusted
population -- the denominator population for the state-yearThe cause_name is the column that gives the table its meaning, and it is not free text—it is drawn from a fixed, official list. Each death certificate carries one underlying cause coded to the International Classification of Diseases, Tenth Revision (ICD-10), and NCHS maps the thousands of ICD-10 codes into a much shorter list of rankable causes so that the same cause categories are used consistently across every state and year. The deaths column is the raw count and drives the rank; the age_adjusted_death_rate is the column that makes cross-state and cross-year comparison legitimate. The rank column is a convenience—it can be recomputed from the counts within each state-year—but it encodes the headline fact the dataset exists to deliver: not how many died of a cause, but where that cause stands in the ordering of how a population dies.
The NVSS, ICD-10, and the single underlying cause
The dataset rests on the National Vital Statistics System, the decentralized arrangement through which the federal government compiles the country's births and deaths. Vital registration in the United States is a state function: a death certificate is completed locally—a physician, medical examiner, or coroner certifies the cause, a funeral director records the demographic facts—and registered with the state's vital-records office. The states then provide the records to NCHS, which standardizes, codes, and aggregates them into the national mortality statistics. NVSS is the closest thing the country has to a complete census of death: it aims to capture every death that occurs in the United States, which is what makes its products counts rather than the survey estimates that characterize much of federal health data.
The intellectual core of the system is the assignment of a single underlying cause to each death. A death certificate's medical section is not a single field—the certifier lists the immediate cause, then the conditions leading to it, then the underlying condition that began the sequence, plus other significant contributing conditions. NCHS applies the World Health Organization's coding rules, embodied in ICD-10 and in NCHS's automated coding software, to that chain to select the one underlying cause of death: the disease or injury that initiated the train of events leading directly to death. This is the central simplification of mortality statistics. A person who dies of pneumonia brought on by complications of diabetes is, for ranking purposes, a diabetes death; the pneumonia and any other listed conditions are contributing causes that do not enter the leading-causes ranking. The leading-causes dataset is therefore a tabulation of underlying causes only—a coherent, internationally comparable basis for ranking, but one that, by design, attributes each death to exactly one cause and discards the rest.
ICD-10 has governed US underlying-cause coding since 1999. The transition from the previous revision is the single most important discontinuity in the modern mortality record: because the rules for selecting and grouping causes changed, death counts and rates for some causes are not directly comparable across the 1999 boundary without the comparability ratios NCHS publishes for exactly that purpose. For a dataset whose entire point is comparison over time and across places, this is not a footnote—it is a constraint that shapes which trends can be read off the data and which require care.
The NCHS list of rankable causes
A ranking is only as meaningful as the categories it ranks, and the leading-causes dataset uses a specific, official grouping rather than the raw ICD-10 codes. NCHS maintains a list of rankable causes of death—a curated subset of its standard cause lists chosen so that the categories are mutually exclusive, clinically meaningful, and stable enough to rank year over year. The ranking is conducted from the agency's 113-cause list (and a separate infant list for deaths under one year), but not every entry on that list is eligible to be ranked: residual “all other” categories and certain non-specific groupings are excluded so that they cannot crowd out the substantive causes. The result is the familiar roster: diseases of heart, malignant neoplasms (cancer), accidents (unintentional injuries), chronic lower respiratory diseases, cerebrovascular diseases (stroke), Alzheimer's disease, diabetes mellitus, and the others that round out the top ten.
Two features of this list matter for analysis. First, the categories vary in granularity in ways that affect their rank. “Accidents” is a broad container that aggregates mechanisms as different as motor-vehicle crashes, falls, and drug poisonings—which is why a surge in drug overdoses pushes the whole unintentional-injury category up the ranking, and why the leading-causes view, by collapsing those mechanisms together, has to be paired with the cause-specific injury data to see what is actually driving the movement. Second, the boundaries of a category can shift as the classification evolves: causes can be added to the rankable list, split, or recombined, and a cause that newly clears the threshold for inclusion can appear to “rise” in part because of a classification decision rather than a purely epidemiological one. The list is the lens; understanding its construction is prerequisite to reading the ranking it produces.
Why age-adjustment makes the rankings comparable
The most important methodological concept in this dataset, and the one most often misread, is age-adjustment. The problem it solves is simple but decisive: death is overwhelmingly a function of age. An older population will have more deaths, and more deaths from age-associated diseases like heart disease, cancer, and Alzheimer's, than a younger one—not because it is less healthy in any meaningful sense, but because more of its members are old. A state like Florida, with a relatively old age structure, will post higher crude death rates than a state like Utah, with a young one, for reasons that have nothing to do with the quality of its health system or environment. Comparing crude rates across states therefore largely measures the difference in their age structures, not the difference in their underlying mortality risk.
Age-adjustment removes that confound. The technique computes mortality rates within narrow age bands and then re-weights them to a common standard population—for US mortality statistics, the year-2000 US standard population—so that every state and every year is evaluated as if it had the same age distribution. The age-adjusted death rate answers the counterfactual question: if this state had the standard population's age structure, what would its death rate from this cause be? Because every jurisdiction is adjusted to the same standard, an older state is no longer penalized for being old, and a younger state is no longer flattered for being young; the rates become directly comparable, and differences in them reflect differences in risk rather than demography. This is why the dataset carries both a crude rate and an age-adjusted rate, and why almost all legitimate cross-state and cross-year comparison—and the analyses below—use the age-adjusted figure. The counts and the rank still tell you how a population actually dies; the age-adjusted rate tells you how dangerous a cause is once age is held constant. Conflating the two is the most common error in reading mortality data.
The top ten and how the ranking has shifted
For most of the modern era the top of the ranking has been remarkably stable: at the national level, heart disease and cancer have long held the first two positions, together accounting for a large share of all US deaths, with unintentional injury (accidents), chronic lower respiratory disease, and stroke rotating through the next several. Beneath that stable summit, however, the ranking has moved in ways that map directly onto the major public-health stories of recent decades, and tracking those movements is one of the dataset's most powerful uses.
Two long-run shifts stand out. Alzheimer's diseasehas climbed steadily up the ranking as the population has aged and as the condition has been recognized and certified on death certificates more consistently—a rise that is partly real demographic and epidemiological change and partly improved attribution, the two woven together in a way the leading-causes view alone cannot fully separate. The other is unintentional injury, driven up the ranking by the drug-overdose epidemic: as overdose deaths surged, the broad accidents category climbed to become, in many years, the third-leading cause nationally and the leading cause of death for younger adults. Because the leading-causes dataset buckets all unintentional mechanisms together, the overdose signal shows up here as a rising accidents rank, which is precisely why this top-level view must be read alongside the cause-specific injury, overdose, and suicide data to understand what is moving and why. The COVID-19 pandemic produced the most dramatic single disruption of all—COVID-19 entered the ranking near the top in its peak years, reshuffling the order before receding—an episode the excess-deaths data quantifies in a complementary way. Reading these shifts in rank and in age-adjusted rate, cause by cause and state by state, is how the dataset turns a static table into a chronicle of changing mortality.
How the ranking differs across states
Because the dataset is resolved to the state level, it exposes the substantial geographic variation in how Americans die—variation that the national ranking averages away. The top two causes are nearly universal: heart disease and cancer lead in essentially every state, a reflection of the dominance of chronic disease across the whole country. But the order of the causes below the top, and the age-adjusted rates at which each cause kills, differ sharply by state, and those differences are where the analytic interest lies.
The patterns are not random. Age-adjusted death rates for the major chronic causes tend to run highest in a band of southern and Appalachian states—a geography of higher rates of heart disease, cancer, and chronic lower respiratory disease that correlates with smoking prevalence, obesity, poverty, and access to care—and lowest in a set of states with healthier behavioral profiles and stronger health systems. Specific causes carry their own geography: chronic lower respiratory disease, tied to smoking history, peaks in particular states; the overdose-driven climb of unintentional injury hit some states far earlier and harder than others as the epidemic spread; and Alzheimer's rates reflect both the local age structure and state-to-state differences in how the condition is certified. Reading the dataset by state—ranking jurisdictions by the age-adjusted rate for a chosen cause, or comparing each state's top-ten ordering against the national one—is what turns the abstract national ranking into a map of where each cause concentrates, and it is the natural starting point for asking why.
The top-level view above the cause-specific files
The leading-causes dataset is best understood as the apex of a layered family of CDC mortality products, all built from the same NVSS death certificates but cut at different levels of detail. The leading-causes table is the most aggregated—the ranking of broad cause categories by state and year. Beneath it sit the cause-specific files that decompose particular categories into the mechanisms and intents the ranking hides.
The relationship is most visible in the injury categories. When unintentional injury rises in the leading-causes ranking, the question of what is driving it—motor-vehicle crashes, falls, or drug poisonings—is answered only by the injury-mortality data, which classifies external-cause deaths by mechanism and intent. Suicide, which appears in the top ten in some years and states, is a single line in the leading-causes ranking but an entire dataset's worth of detail—by method, demographic, and decade—in the suicide-mortality record. And where the leading-causes view shows what people died of, the excess-deaths measure shows how many more died than a historical baseline would predict, regardless of cause—the complementary lens that captured the full mortality impact of the COVID-19 pandemic, including deaths that never got coded to the disease itself. Read together, the leading-causes table tells you the ranking, the injury and suicide files tell you the composition inside the categories that move, and the excess-deaths measure tells you the total burden. The top-level dataset orients; the cause-specific ones explain.
Analytical uses
A complete, state-resolved, age-adjusted ranking of causes of death supports a distinctive set of analyses that no single cause-specific file can deliver, because its strength is the comparison across causes, places, and years.
Ranking causes nationally and by state is the most immediate use: producing, for any year, the ordered top-ten list for the nation or any state, and comparing a state's ordering against the national one to see which causes are elevated or suppressed locally. Tracking rank changes over time turns the static ranking into a trend instrument—watching a cause climb or fall across years exposes the overdose epidemic in the accidents line, the long ascent of Alzheimer's, and the pandemic spike of COVID-19, each as a movement in rank that can be cross-checked against the age-adjusted rate to distinguish a real change in risk from a change in the size of the population at risk.
Comparing age-adjusted rates across states is where the dataset is most rigorous: because the rates are standardized to a common population, an analyst can legitimately rank states by their mortality from a chosen cause and map the geographic concentration of risk—the southern and Appalachian elevation of chronic disease, the uneven spread of the overdose crisis—in a way crude rates would corrupt. Finally, framing the cause-specific files uses the leading-causes view as the index to the rest of the mortality library: it identifies which categories are large or moving and therefore worth decomposing, directing attention to the injury, suicide, and excess-death datasets where the mechanisms behind the ranking live.
Python workflow: ranking causes from the CDC APIs
The script below pulls the NCHS leading-causes-of-death records from CDC's public Socrata API on data.cdc.gov, then computes three of the core analyses: the national top-ten ranking for a chosen year (by death count), the states with the highest age-adjusted rate for a chosen cause, and how a cause's national rank shifted across the years in the data. No API key is required for modest volumes. Because NCHS field names and dataset identifiers vary between releases, the script resolves column names defensively and isolates the dataset id in one place; any production use should be validated against the current data.cdc.gov catalog, and CDC WONDER's underlying-cause query system is the route to custom cross-tabulations the published resource does not pre-compute.
import requests
import pandas as pd
# CDC leading-causes-of-death is published two ways, both used here:
# 1. data.cdc.gov (Socrata) -- a tidy NCHS resource with one row per
# state x cause x year, carrying the death count and the
# age-adjusted rate per 100,000.
# 2. CDC WONDER -- the underlying-cause-of-death query system, which
# returns the same ICD-10 deaths grouped into the NCHS list of
# rankable causes for custom cross-tabulations.
# This script works from the Socrata API: no key for modest volumes, it
# returns JSON, and it ships the pre-computed age-adjusted rates.
SODA = "https://data.cdc.gov/resource"
# The 4x4 Socrata dataset id changes across NCHS releases; isolate it
# here and confirm it against the current data.cdc.gov catalog. bi63-dtpu
# is the NCHS "NCHS - Leading Causes of Death: United States" resource.
LCOD_DATASET = "bi63-dtpu"
def fetch(dataset, where=None, limit=50000):
# Socrata accepts SoQL query parameters ($where, $limit).
params = {"$limit": limit}
if where:
params["$where"] = where
url = f"{SODA}/{dataset}.json"
r = requests.get(url, params=params, timeout=120)
r.raise_for_status()
return pd.DataFrame(r.json())
def _col(df, *names):
# Resolve the first matching column name actually present. NCHS field
# names vary by release (cause_name vs leading_cause, etc.).
for n in names:
if n in df.columns:
return n
return None
def load():
df = fetch(LCOD_DATASET)
if df.empty:
return df
df["_year"] = pd.to_numeric(df.get("year"), errors="coerce")
df["_deaths"] = pd.to_numeric(df.get("deaths"), errors="coerce")
rate_col = _col(df, "aadr", "age_adjusted_death_rate", "rate")
df["_aadr"] = pd.to_numeric(df.get(rate_col), errors="coerce")
return df
# --- 1. Rank causes nationally for one year ---------------------------
# The resource carries a "United States" geography row alongside the
# states; use it for the national ranking by death count.
def national_rank(year=2017):
df = load()
state_col = _col(df, "state", "geography", "jurisdiction")
cause_col = _col(df, "cause_name", "leading_cause", "cause")
nat = df[(df[state_col].astype(str).str.lower() == "united states")
& (df["_year"] == year)
& (~df[cause_col].astype(str).str.contains("All caus",
case=False, na=False))]
ranked = nat.groupby(cause_col)["_deaths"].sum().sort_values(ascending=False)
print(f"Leading causes of death, United States, {year}:")
for i, (cause, n) in enumerate(ranked.head(10).items(), start=1):
print(f" {i:>2}. {str(cause)[:34]:<34} {int(n):>9,}")
return ranked
# --- 2. Highest age-adjusted rate by state for one cause --------------
def states_by_rate(cause="Heart disease", year=2017):
df = load()
state_col = _col(df, "state", "geography", "jurisdiction")
cause_col = _col(df, "cause_name", "leading_cause", "cause")
sub = df[(df["_year"] == year)
& (df[cause_col].astype(str).str.contains(cause, case=False,
na=False))
& (df[state_col].astype(str).str.lower() != "united states")]
ranked = sub.groupby(state_col)["_aadr"].mean().sort_values(ascending=False)
print(f"\nHighest age-adjusted {cause} rate per 100,000, {year}:")
for state, rate in ranked.head(10).items():
print(f" {str(state)[:20]:<20} {rate:6.1f}")
return ranked
# --- 3. Track how a cause’s national rank shifted over time -----------
def rank_over_time(cause="Alzheimer"):
df = load()
state_col = _col(df, "state", "geography", "jurisdiction")
cause_col = _col(df, "cause_name", "leading_cause", "cause")
nat = df[(df[state_col].astype(str).str.lower() == "united states")
& (~df[cause_col].astype(str).str.contains("All caus",
case=False, na=False))]
print(f"\nNational rank of {cause} by year:")
for yr in sorted(nat["_year"].dropna().unique()):
yr_df = nat[nat["_year"] == yr]
order = yr_df.groupby(cause_col)["_deaths"].sum().sort_values(
ascending=False).index.tolist()
hit = [c for c in order if cause.lower() in str(c).lower()]
if hit:
print(f" {int(yr)} rank {order.index(hit[0]) + 1}")
national_rank(2017)
states_by_rate("Heart disease", 2017)
rank_over_time("Alzheimer")
Two practical notes apply. First, the national ranking in the script ranks by raw death count, which is the correct basis for the leading-causes ordering itself—the ranking is, by definition, about how many people a cause actually kills—while the cross-state comparison ranks by age-adjusted rate, which is the correct basis for comparing the danger of a cause across populations of different ages. Using the wrong measure for either question is the single most common analytic mistake with this data, and the two functions are deliberately split to keep them straight. Second, for trend work that crosses the 1999 ICD-10 boundary, the raw counts are not directly comparable across the transition; NCHS's comparability ratios must be applied before a pre-1999 and post-1999 rate are placed on the same axis, and the published resource generally covers the ICD-10 era for exactly this reason.
Limitations and analytical caveats
The leading-causes dataset is the authoritative top-level account of mortality in the United States, but it carries structural limitations that an analyst must internalize before drawing conclusions from it.
Every death is reduced to one cause. The single underlying-cause rule is what makes the ranking coherent, but it also means the dataset cannot see comorbidity. A death certified to heart disease in a person who also had diabetes and chronic kidney disease counts only against heart disease; the contributing conditions vanish from the ranking. For diseases that more often appear as contributing than as underlying causes—diabetes is the classic example—the leading-causes ranking systematically understates their true role in mortality. Multiple-cause-of-death analysis, available in the underlying NVSS files but not in this ranking, is the only way to recover that fuller picture.
Coding changes and certification practices shape the ranking. The 1999 ICD-10 transition broke direct comparability for some causes, and within the ICD-10 era the way conditions are certified evolves: improved recognition and more consistent certification of Alzheimer's disease, for instance, lift its counts for reasons that are partly artifactual. A rise or fall in a cause's rank can therefore reflect a change in how deaths are coded and certified rather than a change in how often people die of the underlying condition. Movements in the ranking should be interpreted with the classification history in view, not as if the categories were fixed.
Small counts make small-state and rare-cause rankings unstable. In a low-population state, the causes ranked seventh through tenth may be separated by only a handful of deaths, so a cause's rank can swing from year to year on statistical noise alone, and age-adjusted rates built on small numbers of deaths carry wide confidence intervals. NCHS suppresses or flags rates based on too few deaths for precisely this reason. Reading a one-place change in a small state's ranking as a real shift, or comparing rates that rest on tiny counts without their margins of error, over-reads what the data can bear.
Recent years are provisional, and the data is not real-time. Mortality data is finalized only after a substantial lag—deaths must be registered, coded, and processed—so the most recent year or two in any release is provisional and subject to revision as late certificates arrive and coding is completed. Provisional counts for recent periods are systematically incomplete and will rise. The leading-causes dataset is authoritative for established patterns and multi-year trends; it is not a current monitor of this year's deaths.
Held with these caveats in mind, the cdc_leading_causes table is a uniquely valuable resource: a state-resolved, age-adjusted, year-by-year ranking of the causes that kill the most Americans—the top-level map that tells you what to look at, sitting above the cause-specific mortality files that tell you why, and together composing the most complete federal account there is of how the country dies.
Related writing
CDC Injury Mortality: The Federal Record of How Americans Die from Firearms, Overdoses, and Crashes — When unintentional injury climbs the leading-causes ranking, this is the dataset that decomposes the broad accidents category into the mechanisms and intents—crashes, falls, firearms, and the drug overdoses—that the top-level view buckets together.
CDC Suicide Mortality: The Federal Record of a Public-Health Crisis Over Seven Decades — Suicide is a single line in the leading-causes ranking but an entire dataset's worth of detail—by method, demographic, and decade—in the cause-specific suicide record built from the same death certificates.
CDC Excess Deaths: The Federal Measure of How Many More Americans Died Than Expected — Where the leading-causes table shows what people died of, the excess-deaths measure shows how many more died than a historical baseline predicts—the complementary lens that captured the full mortality burden of the COVID-19 pandemic, including deaths never coded to the disease.