Technical writing

CDC Nutrition, Physical Activity, and Obesity: The Federal Surveillance Record of American Health Behavior

· 10 min read· AI Analytics
CDCObesityNutritionPublic HealthFederal Data

Every year the federal government calls roughly four hundred thousand Americans on the telephone and asks them, among hundreds of other questions, how tall they are, how much they weigh, how often they exercise, and how many vegetables they eat. The CDC Nutrition, Physical Activity, and Obesity dataset is the structured, state-by-state distillation of those answers — the single most comprehensive federal record of how American health behavior varies across geography, income, race, and education, and the empirical backbone of the national conversation about obesity.

The dataset is published by the CDC Division of Nutrition, Physical Activity, and Obesity — universally abbreviated DNPAO — and it is built almost entirely from a single underlying instrument: the Behavioral Risk Factor Surveillance System, or BRFSS. In the table we catalog as cdc_npao_states, there are roughly 109,180 rows. Each row is one estimate: a prevalence percentage for one health behavior, in one state or territory, in one year, for one demographic slice of the population. Stacked together, those rows trace the contours of the American obesity epidemic with a level of demographic and geographic detail that no other federal source matches. This is the data behind the maps you have seen turn from light blue to dark red over three decades, and behind the state obesity programs, nutrition assistance debates, and built-environment policies that those maps helped to justify.

What the dataset is, and the survey behind it

BRFSS is, by a wide margin, the largest continuously conducted health survey in the world. It began in 1984 with fifteen states and has run every year since, expanding to cover all fifty states, the District of Columbia, and several US territories. Today it collects more than 400,000 completed interviews annually. It is a genuinely federal-and-state partnership: the CDC designs the core questionnaire and provides the methodological backbone, but the actual interviewing is conducted by state health departments (or their contractors), who can append state-specific modules to the national core. The result is a survey that is nationally standardized in its core questions yet locally administered, which is precisely what makes state-to-state comparison meaningful.

BRFSS is a telephone survey of non-institutionalized adults aged eighteen and older. Since the early 2010s it has used both landline and cellular telephone samples, with disproportionate stratified sampling on the landline frame and a separate random sample of cell-phone numbers. Respondents are selected to be representative of the adult population within each state, and the raw responses are weighted to match the state's demographic distribution by age, sex, race, ethnicity, education, marital status, home ownership, and phone type. That weighting is what allows a few thousand interviews per state to stand in for millions of residents.

The DNPAO dataset is a curated extract of BRFSS focused specifically on the topics DNPAO cares about: body weight, physical activity, and diet. Rather than publish the raw respondent-level microdata, DNPAO publishes pre-computed prevalence estimates — the weighted percentage of adults in each population cell who answer a given question a given way, together with the statistical uncertainty around that percentage. This is the form the data takes in our catalog and on the CDC's open-data portal, and it is the form most analysts actually want: ready-made, comparable, demographically stratified prevalence figures.

The anatomy of a single row is worth dwelling on, because understanding it is the key to using the dataset correctly. Each record carries a year (the data are organized by survey year, with most indicators reported annually); a location, which is one of the fifty states, the District of Columbia, a US territory, or a national rollup; a question or topic, which identifies the specific behavior being measured; a demographic stratification, which tells you whether the row is for all adults or for a particular subgroup; the data value itself, a percentage; the lower and upper 95 percent confidence limits around that percentage; and the sample size on which the estimate is based. The stratification is the dimension that multiplies the row count: for each state and each question, there is not one number but a whole family of numbers — one overall, then breakdowns by age group, by sex, by race and ethnicity, by household income bracket, and by education level. Multiply states by years by questions by stratifications and the roughly 109,180 rows are the natural result.

The DNPAO topics and the obesity epidemic

The question field is what gives the dataset its substance. DNPAO tracks a focused set of indicators, each defined precisely enough to be reproducible year over year. On body weight, the headline indicators are the percentage of adults who have obesity (a body mass index of 30.0 or higher) and the percentage of adults who have overweight (a BMI of 25.0 to 29.9). On physical activity, the dataset reports the percentage of adults who meet the federal aerobic activity guideline, the percentage who meet the muscle-strengthening guideline, the percentage who meet both, and the percentage who engage in no leisure-time physical activity at all. On diet, it tracks fruit consumption and vegetable consumption, typically expressed as the percentage of adults who consume fruit, or vegetables, less than one time per day — a deliberately low bar that nonetheless a large share of the population fails to clear.

The obesity indicator is the one that has reshaped American public health discourse. When BRFSS began producing state obesity maps, no state had an adult obesity prevalence above roughly fifteen percent. By the 2000s, no state was below twenty percent. By the late 2010s, the national adult obesity rate had crossed forty percent, and a growing number of individual states had crossed thirty-five percent. The trajectory from 1990 to 2020 is, in the DNPAO data, a remarkably smooth and remorseless upward climb — a public health trend with almost no reversals at the national level, interrupted only by the methodological discontinuity we will come to shortly.

Layered on top of that temporal trend is a stark and stable geographic gradient. Year after year, the leanest states cluster in the Mountain West and the coasts — Colorado has been the lowest-obesity state for essentially the entire history of the series, joined typically by states such as Hawaii, Massachusetts, California, and the rest of the interior West. The heaviest states cluster in the Deep South and Appalachia: Mississippi, West Virginia, Alabama, Louisiana, Arkansas, and their neighbors routinely occupy the top of the ranking. The gap between the leanest and heaviest states is not small. In a typical recent year it spans well over ten percentage points, meaning that an adult's probability of having obesity differs by roughly a third or more depending simply on which state they live in. That gradient is one of the most durable findings in all of American population health, and it is DNPAO that documents it.

Self-reported BMI and the under-reporting problem

Here is the single most important caveat for anyone using this dataset, and it deserves its own section: BRFSS obesity figures are based on self-reported height and weight. The interviewer asks respondents how tall they are and how much they weigh, computes BMI from those answers, and classifies the respondent accordingly. No one is measured. And it is a well-established, repeatedly replicated finding that people misreport their bodies in a systematic direction — respondents tend to overstate their height and understate their weight, both of which push the computed BMI downward and both of which therefore make the population look leaner than it is.

The benchmark against which this bias is measured is the National Health and Nutrition Examination Survey, or NHANES, in which trained technicians physically measure each participant's height and weight in a mobile examination center. NHANES is the gold standard for body composition precisely because it does not rely on self-report. And the gap between the two surveys is large and consistent: NHANES, with measured values, produces adult obesity estimates several percentage points higher than BRFSS, with measured-survey national adult obesity now standing well above the self-reported figure. The implication is that the true national obesity prevalence is meaningfully higher than the already-alarming BRFSS number, and that the DNPAO series should be read as a conservative lower bound on the epidemic's scale.

This does not make the DNPAO data less useful — it makes it useful for different things. BRFSS cannot give you the precise national prevalence; NHANES does that. What BRFSS gives you, and NHANES cannot, is the state-by-state and demographically stratified picture. NHANES is a national sample with no state-level estimates; its examination logistics make a fifty-state design impossible. So the two surveys are complements: NHANES anchors the absolute level, BRFSS supplies the geographic and subgroup detail. As long as the reporting bias is roughly constant across states and over time — an assumption that is reasonable but not perfect — the BRFSS rankings and trends remain valid even though the absolute levels are understated.

The demographic disparities

The stratification fields turn DNPAO from a map of states into a map of inequality. Because every indicator is broken out by race and ethnicity, income, education, age, and sex, the dataset lets you see not just where obesity is high but among whom — and the disparities it reveals are some of the most consequential in American health.

By race and ethnicity, adult obesity prevalence is consistently highest among non-Hispanic Black adults and Hispanic adults, lower among non-Hispanic White adults, and lowest among non-Hispanic Asian adults, who in the BRFSS data report obesity prevalence far below every other group. These patterns hold across most states and most years, though their magnitude varies regionally. The disparity is not uniform across sex, either: the racial gap in obesity is generally wider among women than among men, a pattern that recurs throughout the chronic-disease literature.

The income and education gradients are equally pronounced and run in the direction one would expect from the broader social-determinants-of-health literature, though with some nuance. Adult obesity prevalence generally declines as household income rises and as educational attainment increases: adults in the lowest income brackets and those without a high school diploma report substantially higher obesity prevalence than adults in the highest income brackets and those with a college degree. The education gradient in particular tends to be steep and consistent — college graduates report markedly lower obesity than adults who did not finish high school. These gradients are the empirical fuel for the argument that obesity is not merely a matter of individual choice but is patterned by economic and educational opportunity, by the affordability and availability of healthy food, and by the time and infrastructure available for physical activity.

The same stratification structure applies to the physical-activity and diet indicators, and the disparities there largely mirror the obesity disparities — meeting the federal activity guidelines is more common among higher-income and more-educated adults, and adequate fruit and vegetable consumption follows the same socioeconomic slope. The internal consistency across indicators is itself reassuring evidence that the survey is capturing real population differences rather than artifacts of any single question.

Physical activity, diet, and the federal guidelines

The DNPAO physical-activity and nutrition indicators are not free-floating numbers; they are deliberately constructed to map onto specific federal recommendations, which is what makes them actionable. The aerobic-activity indicator measures the share of adults who meet the aerobic component of the Physical Activity Guidelines for Americans — broadly, at least 150 minutes per week of moderate-intensity activity, or 75 minutes of vigorous activity, or an equivalent combination. The muscle-strengthening indicator measures the share who meet the guideline of strengthening activities on two or more days per week. The combined indicator captures the smaller share who meet both, which is the full guideline as written. And the no-leisure-time-physical-activity indicator captures the opposite pole: adults who report no physical activity outside of work at all, a population at especially elevated chronic-disease risk.

On the diet side, the fruit and vegetable indicators connect to federal dietary guidance emphasizing fruit and vegetable intake as a cornerstone of a healthy diet. Because BRFSS cannot measure precise dietary intake over the phone the way a detailed dietary recall would, the indicators use a coarse but reliable frequency measure — how often a person eats fruit or vegetables — and report the share falling below a daily threshold. The coarseness is a limitation, but it has the virtue of being answerable accurately in a telephone interview, which a gram-level intake question is not. The mapping of every indicator to an explicit federal guideline is what lets a state health officer translate a DNPAO number directly into a policy target.

How this data supports policy

DNPAO data is not collected for academic interest; it is collected to drive intervention, and it does so through several concrete channels. The most direct is the CDC's own state obesity and chronic-disease programs. CDC funds state health departments through cooperative agreements to implement nutrition and physical-activity interventions, and DNPAO surveillance data is how those programs are targeted and evaluated. A state cannot demonstrate that an obesity-prevention program is working without a consistent year-over-year prevalence measure, and DNPAO is that measure. The same data populates CDC's public-facing tools, including the long-running adult obesity prevalence maps that have become a fixture of public health communication.

The data also informs federal nutrition-assistance policy. The Supplemental Nutrition Assistance Program (SNAP) and the Special Supplemental Nutrition Program for Women, Infants, and Children (WIC) both intersect with obesity and diet quality, and DNPAO's income-stratified obesity and nutrition indicators are part of the evidence base for arguments about how those programs should be structured — whether they adequately support healthy eating among low-income households, and how their reach correlates with the geography of diet-related disease. The strong income gradient in the DNPAO data is frequently cited in debates over nutrition-assistance design.

Finally, DNPAO data underpins the built-environment strand of obesity policy. The recognition that physical activity depends on the physical environment — on the availability of sidewalks, parks, safe routes to school, recreational facilities, and access to grocery stores rather than only convenience stores — is supported by the geographic patterning that DNPAO documents. When activity rates are low and obesity high in a region, the data motivates investment in the infrastructure that makes active living possible, and it provides the baseline against which such investments can later be assessed.

What you can do with it

The dataset rewards several distinct kinds of analysis. The most immediate is the state ranking: filter to the adult-obesity question, to the total (all-adults) stratification, and to a single year, and you can order all fifty states from leanest to heaviest — reproducing, from the raw data, the maps that drive public health attention. Because every estimate carries confidence limits, you can do this responsibly, flagging cases where two states' intervals overlap so heavily that their rank difference is not statistically meaningful.

The second is trend analysis. Because the dataset spans decades, you can pull a single indicator for a single state across all available years and plot its trajectory — the slow climb of obesity, the stagnation or improvement in activity rates — subject to the 2011 methodological break discussed below, which you must treat as a discontinuity rather than a real change. The third is disparity analysis: hold the state and year fixed, iterate over the stratification subgroups, and quantify the gap between, say, the lowest and highest income brackets, or between racial and ethnic groups, within a single jurisdiction.

The fourth, and analytically the richest, is correlation with other federal datasets. DNPAO prevalence figures are at the state and (in companion datasets) county level, which means they can be joined to food-access data such as the USDA Food Access Research Atlas, to chronic-disease outcome data such as diabetes and heart-disease prevalence, and to socioeconomic data from the Census. Correlating obesity prevalence against food-desert measures, or against downstream chronic-disease rates, turns DNPAO from a descriptive snapshot into an input for genuine epidemiological inquiry. Because the geographic keys are standard FIPS-style state and territory identifiers, these joins are mechanically straightforward.

A worked example in Python

DNPAO is surfaced through the CDC's Socrata open-data platform at chronicdata.cdc.gov, which exposes a standard JSON API supporting the Socrata Query Language (SoQL) for server-side filtering, selection, and pagination. The script below pulls the DNPAO state dataset, filters to the adult-obesity indicator for the most recent available year, ranks the states from leanest to heaviest, and then computes the national income and education gradients. It is deliberately written to lean on server-side filtering so that it transfers only the rows it needs rather than the full hundred-thousand-row table.

Two field-level notes before the code. First, DNPAO carries both a human-readable question string and a stable questionid code; matching on the identifier is more robust than matching on prose, because the wording of a question can be edited over time while its identifier stays fixed. Second, the demographic breakouts live in a pair of fields — a stratificationcategory1 that names the dimension (Total, Age, Sex, Race/Ethnicity, Income, Education) and a stratification1 that names the specific value within it — with the income and education values also surfaced in their own named columns. Selecting the Total category is how you get the all-adults figure rather than a subgroup.

import requests
import pandas as pd

# ---------------------------------------------------------------
# CDC Nutrition, Physical Activity, and Obesity (DNPAO)
# Source: BRFSS, surfaced via the CDC Socrata open-data platform
#
# This script:
#   1. Pulls the DNPAO "by state" dataset (chronicdata.cdc.gov)
#   2. Filters to the adult-obesity question for the latest year
#   3. Ranks states from leanest to heaviest
#   4. Computes the income and education gradients
# ---------------------------------------------------------------

# DNPAO national/state/territory dataset on the CDC chronic-disease
# Socrata domain. The four-by-four ("hn4x-zwk7") resource id is the
# long-running "Nutrition, Physical Activity, and Obesity - BRFSS"
# table; confirm the current id from the dataset landing page.
ENDPOINT = "https://chronicdata.cdc.gov/resource/hn4x-zwk7.json"

# The adult-obesity indicator. DNPAO carries the human-readable text
# in the "question" field; matching on the QuestionID is more stable.
OBESITY_QID = "Q036"   # "Percent of adults aged 18 years and older who have obesity"
LATEST_YEAR = "2022"


def fetch(params: dict, page_size: int = 5000) -> list[dict]:
    """Paginate through a CDC Socrata endpoint."""
    rows: list[dict] = []
    offset = 0
    while True:
        paged = {**params, "\$limit": page_size, "\$offset": offset}
        resp = requests.get(ENDPOINT, params=paged, timeout=60)
        resp.raise_for_status()
        batch = resp.json()
        if not batch:
            break
        rows.extend(batch)
        if len(batch) < page_size:
            break
        offset += page_size
    return rows


def to_float(val):
    try:
        return float(val)
    except (TypeError, ValueError):
        return None


# ---------------------------------------------------------------
# Step 1: State ranking, overall (total) adult obesity prevalence
# ---------------------------------------------------------------
# stratificationcategory1 == "Total" selects the all-adults value
# for each location, rather than a demographic subgroup.
state_rows = fetch({
    "\$select": "locationdesc, data_value, low_confidence_limit, "
                "high_confidence_limit, sample_size",
    "\$where": (
        f"questionid = '{OBESITY_QID}' "
        f"AND yearstart = '{LATEST_YEAR}' "
        "AND stratificationcategory1 = 'Total' "
        "AND data_value IS NOT NULL "
        # exclude national rollups and aggregate regions
        "AND locationabbr NOT IN ('US','PR','GU','VI','DC')"
    ),
    "\$order": "data_value ASC",
})

states = pd.DataFrame(state_rows)
for c in ["data_value", "low_confidence_limit",
          "high_confidence_limit", "sample_size"]:
    states[c] = states[c].apply(to_float)
states = states.dropna(subset=["data_value"]).reset_index(drop=True)
states.index += 1

print(f"Adult Obesity Prevalence by State, {LATEST_YEAR} (BRFSS / DNPAO)")
print("-" * 70)
print(f"{'Rank':<6}{'State':<24}{'Obesity %':>10}{'95% CI':>18}{'n':>10}")
print("-" * 70)
for rank, row in states.iterrows():
    ci = f"{row['low_confidence_limit']:.1f}-{row['high_confidence_limit']:.1f}"
    n = int(row["sample_size"]) if pd.notna(row["sample_size"]) else 0
    print(
        f"{rank:<6}{row['locationdesc']:<24}"
        f"{row['data_value']:>10.1f}{ci:>18}{n:>10,}"
    )

leanest = states.iloc[0]
heaviest = states.iloc[-1]
print()
print(f"Leanest:  {leanest['locationdesc']} ({leanest['data_value']:.1f}%)")
print(f"Heaviest: {heaviest['locationdesc']} ({heaviest['data_value']:.1f}%)")
print(f"Spread:   {heaviest['data_value'] - leanest['data_value']:.1f} pts")
print(f"National median across states: {states['data_value'].median():.1f}%")


# ---------------------------------------------------------------
# Step 2: The income gradient (national rollup, US)
# ---------------------------------------------------------------
income_rows = fetch({
    "\$select": "income, data_value, low_confidence_limit, high_confidence_limit",
    "\$where": (
        f"questionid = '{OBESITY_QID}' "
        f"AND yearstart = '{LATEST_YEAR}' "
        "AND locationabbr = 'US' "
        "AND stratificationcategory1 = 'Income' "
        "AND data_value IS NOT NULL"
    ),
})

income = pd.DataFrame(income_rows)
income["data_value"] = income["data_value"].apply(to_float)
# Order the brackets from lowest to highest income for a clean gradient
income_order = [
    "Less than $15,000", "$15,000 - $24,999", "$25,000 - $34,999",
    "$35,000 - $49,999", "$50,000 - $74,999", "$75,000 or greater",
]
income["rank"] = income["income"].apply(
    lambda b: income_order.index(b) if b in income_order else 99
)
income = income.sort_values("rank").reset_index(drop=True)

print()
print(f"Adult Obesity by Household Income, US, {LATEST_YEAR}")
print("-" * 50)
print(f"{'Income bracket':<28}{'Obesity %':>12}")
print("-" * 50)
for _, row in income.iterrows():
    print(f"{row['income']:<28}{row['data_value']:>12.1f}")


# ---------------------------------------------------------------
# Step 3: The education gradient (national rollup, US)
# ---------------------------------------------------------------
edu_rows = fetch({
    "\$select": "education, data_value",
    "\$where": (
        f"questionid = '{OBESITY_QID}' "
        f"AND yearstart = '{LATEST_YEAR}' "
        "AND locationabbr = 'US' "
        "AND stratificationcategory1 = 'Education' "
        "AND data_value IS NOT NULL"
    ),
})

edu = pd.DataFrame(edu_rows)
edu["data_value"] = edu["data_value"].apply(to_float)
edu_order = [
    "Less than high school", "High school graduate",
    "Some college or technical school", "College graduate",
]
edu["rank"] = edu["education"].apply(
    lambda b: edu_order.index(b) if b in edu_order else 99
)
edu = edu.sort_values("rank").reset_index(drop=True)

print()
print(f"Adult Obesity by Education, US, {LATEST_YEAR}")
print("-" * 50)
print(f"{'Education level':<36}{'Obesity %':>12}")
print("-" * 50)
for _, row in edu.iterrows():
    print(f"{row['education']:<36}{row['data_value']:>12.1f}")

if len(edu) >= 2:
    span = edu["data_value"].iloc[0] - edu["data_value"].iloc[-1]
    print()
    print(
        "Education gradient (less-than-HS minus college grad): "
        f"{span:.1f} pts"
    )

Running this against the live endpoint produces a clean leanest-to-heaviest ranking in which Colorado and the Mountain West sit at the top and the Deep South states at the bottom, an income table in which obesity prevalence falls monotonically (or nearly so) as income rises, and an education table in which the gap between adults without a high school diploma and college graduates typically runs to double digits in percentage points. The exact resource identifier and the latest available survey year should be confirmed against the dataset's landing page before a production run, since the CDC periodically republishes DNPAO under refreshed dataset versions.

Caveats and limitations

Four limitations govern any responsible use of this dataset, and the first — the self-report bias in height and weight — has already been discussed. The DNPAO obesity and overweight figures understate true prevalence, and they should be read against measured NHANES values when an absolute level matters. The bias is generally treated as roughly constant across states and over time, which preserves rankings and trends, but it is a modeling assumption rather than a certainty.

The second, and the one most likely to trip up a trend analysis, is the 2011 BRFSS methodological change. In 2011 BRFSS made two simultaneous and consequential changes: it added cellular telephone numbers to the sampling frame in a substantial way (correcting for the growing share of cell-phone-only households that landline-only sampling had been missing), and it adopted a new statistical weighting method known as iterative proportional fitting, or raking, in place of the older post-stratification approach. Both changes improved the survey's accuracy going forward, but together they introduced a discontinuity in the series. Estimates from 2011 onward are not directly comparable to estimates from 2010 and earlier, and the CDC explicitly warns against treating any change across that boundary as a real shift in the population. Any longitudinal analysis must either confine itself to the post-2011 era or treat the break as a structural break in the series.

The third is small-cell suppression and statistical instability. When a demographic subgroup within a state yields too few respondents, the resulting estimate is either suppressed entirely or carries confidence limits so wide as to be nearly uninformative. This is most acute for small racial and ethnic subgroups in small-population states, where a single year may simply not contain enough interviews to support a reliable subgroup estimate. The presence of confidence limits on every row is the dataset's honest acknowledgment of this; an analysis that ignores those limits and treats every point estimate as equally solid will draw false conclusions from noise. Always inspect the sample size and the width of the interval before trusting a subgroup figure.

The fourth is the telephone-survey frame itself. BRFSS reaches only adults who can be contacted by telephone and who agree to be interviewed, and despite careful weighting, the non-institutionalized telephone-reachable population is not a perfect mirror of the whole adult population. Adults without any telephone, those in institutions such as nursing homes and prisons, and those who systematically decline to participate are underrepresented, and response rates to telephone health surveys have declined over the decades. Weighting corrects for known demographic skews but cannot correct for unmeasured differences between those who answer and those who do not. The DNPAO data is the best state-level behavioral surveillance the United States has, and it is genuinely excellent for that purpose — but it is a survey of who picks up the phone and what they are willing to say, and every number it contains should be read in that light.

Related writing

CDC foodborne outbreak surveillance (FDOSS) — the federal database behind every US food poisoning investigation.

CMS hospital quality data — the federal scorecard on how American hospitals actually perform.

FDA food enforcement and recalls — the regulatory record of contaminated and adulterated food removed from commerce.