Technical writing
NAEP: The Nation's Report Card and the Federal Dataset Behind US Education Achievement
Every two years the federal government publishes a single set of test scores that no state can manipulate, no school district can opt out of, and no governor can redefine to look more flattering. That dataset is the National Assessment of Educational Progress — NAEP — and the 2022 release delivered the most alarming numbers in a generation: the largest reading score decline in thirty years and the first-ever recorded drop in mathematics scores, both directly attributable to pandemic school closures. NAEP is the only nationally representative, continuing measure of what American students actually know, and understanding its structure is essential for anyone who works with education data.
What NAEP Is and How It Came to Be
The National Assessment of Educational Progress was authorized by Congress and administered for the first time in 1969, making it the longest-running federal education survey in the United States. Congress formally reauthorized NAEP in 1988 under the Hawkins-Stafford Amendments and again in 1994 and 2001. The No Child Left Behind Act of 2001 was a turning point: it mandated biennial NAEP assessments in reading and mathematics at grades 4 and 8 for every state that receives Title I funding — which is all of them. That mandate is why state-level 4th and 8th grade reading and math scores are the core of the NAEP program and the source of most public attention.
NAEP is administered by the National Center for Education Statistics (NCES), the statistical agency of the U.S. Department of Education, operating under the Institute of Education Sciences (IES). The operational contract for test development and administration has historically been held by Educational Testing Service (ETS) and Westat. Governance sits with the National Assessment Governing Board (NAGB), an independent bipartisan body created by Congress in 1988 to set policy for NAEP — including the achievement level standards. This separation between the testing contractor, the federal statistical agency, and the independent governing board is deliberate: it insulates NAEP from political pressure to change standards or selectively release results.
The label “The Nation's Report Card” is the official public-facing brand for NAEP, hosted at nationsreportcard.gov. It is apt in one sense and misleading in another. NAEP is genuinely the only consistent, comparable yardstick that cuts across all fifty states and major urban districts. But unlike a school report card, NAEP scores are not fed back to individual students, teachers, or schools — they are population estimates for jurisdictions, not diagnostic tools for classrooms.
What NAEP Measures: Subjects, Grades, and Programs
NAEP assesses students across a wide range of subjects: Reading, Mathematics, Science, Writing, U.S. History, Geography, Civics, Technology and Engineering Literacy, and Arts. Not every subject is assessed every year or at every grade. The most frequent assessments — and the most policy-relevant — are Reading and Mathematics at grades 4 and 8, administered every other year.
Grade 12 assessments in reading and mathematics are conducted less frequently and draw much less policy attention, in part because grade 12 sampling is complicated by high rates of absenteeism and dropout, which can introduce selection bias.
Within the NAEP program there are three distinct assessment programs that researchers must distinguish:
- National NAEP — uses nationally representative samples designed to produce estimates for the country as a whole, by demographic subgroup, and by region. The national sample has existed since the first NAEP in 1969.
- State NAEP — uses state-representative samples designed to produce reliable estimates for each individual state. Mandated by NCLB starting in 2003. This is the program that enables state-to-state comparisons on a common scale, which is its defining value.
- Trial Urban District Assessment (TUDA) — an extension of State NAEP that oversam-ples students in 27 large urban school districts (including New York City, Chicago, Los Angeles, Houston, and others) to produce reliable district-level estimates. TUDA results allow direct comparisons between urban districts on the NAEP scale, which is otherwise impossible since district-level samples are too small to support reliable estimates from the state sample alone.
Separate from the main NAEP program is the Long-Term Trend NAEP (LTT). LTT has assessed students at ages 9, 13, and 17 (rather than grades) in reading and mathematics since 1969 and 1971 respectively, using a largely unchanged assessment format to preserve comparability across decades. LTT is the only federal data source that can answer the question “how does student achievement in 2024 compare to 1971?” The main NAEP assessments have undergone content revisions that prevent direct cross-decade comparisons on the main scale.
The NAEP Scale and Achievement Levels
NAEP results are reported on a 0–500 scale. The scale is designed to support comparisons across years within a subject and grade, but not across subjects or grades — a score of 240 in 4th grade math means something entirely different from 240 in 8th grade reading.
The National Assessment Governing Board sets three achievement levels for each subject-grade combination:
- Basic — denotes partial mastery of prerequisite knowledge and skills that are fundamental for proficient work at each grade. Students performing at Basic can demonstrate some competency but fall short of solid grade-level performance.
- Proficient — represents solid academic performance and competency over challenging subject matter. This is the policy target embedded in NCLB and subsequent federal education law. Students at or above Proficient are the benchmark used by policymakers and the press.
- Advanced — denotes superior performance beyond grade-level expectations.
The 2022 results made these definitions painfully concrete. Nationally, only about 33% of 4th graders scored at or above Proficient in reading, and roughly 36% reached Proficient in mathematics. In 8th grade, roughly 31% were at or above Proficient in reading and 26% in mathematics. These percentages — meaning that well under half of American students meet the federal Proficient standard in core subjects — are the numbers that drive education reform debates, state legislative hearings, and the perennial arguments about whether American schools are failing.
One persistent methodological critique of NAEP achievement levels is that NAGB sets them through a somewhat subjective process called the “modified Angoff” method, in which panels of educators and policy representatives estimate the probability that a minimally proficient student would answer each item correctly. The resulting cut scores have been criticized by the National Academy of Sciences as “fundamentally flawed” and are widely seen as demanding — more so than most state standards. The critique matters because the absolute percentage at Proficient depends entirely on where the cut score is set, and NAGB's cuts are arguably aspirational rather than descriptive of grade-level mastery as most teachers understand it.
COVID Learning Loss: The 2022 Results in Context
The 2022 NAEP results, released in October 2022, are now the central empirical foundation for every serious discussion of pandemic-era education impact in the United States. The headline findings were stark. Fourth grade reading scores fell 3 points nationally between 2019 and 2022 — the largest decline in that measure in thirty years of State NAEP administration. Eighth grade mathematics scores fell 8 points, the first recorded decline in that series since State NAEP began.
The declines were not uniform. Urban districts assessed through TUDA showed the steepest drops. Students in the lowest quartile of the score distribution suffered the largest absolute losses, meaning the pandemic both lowered average achievement and widened the gap between high- and low-performing students. Large cities that had among the most restrictive and prolonged school closure policies in 2020 and 2021 — including several major TUDA districts — showed disproportionate losses, consistent with the hypothesis that in-person instruction time is particularly important for students who depend on school as their primary academic support.
The magnitude of the declines was alarming in historical context. NAEP scores had been generally rising since the early 1990s. The 8-point drop in 8th grade math essentially erased two decades of gains. Subsequent NAEP administrations in 2024 showed partial but incomplete recovery, a pattern broadly consistent with what research on educational disruptions from natural disasters and recessions would predict: most students recover somewhat with sustained exposure to effective instruction, but a portion of learning loss persists if remediation is inadequate.
Achievement Gap Data: NAEP as the Federal Standard
NAEP is the primary federal source for tracking achievement gaps by race and ethnicity, income, disability status, and English learner status over time. Because the same instrument is administered consistently using population sampling, NAEP gap estimates are methodologically defensible in a way that comparisons between different states' own assessments are not.
The White–Black gap in 4th grade reading has been tracked since the 1992 State NAEP and stands at roughly 25 scale-score points. It narrowed modestly through the early 2000s as No Child Left Behind accountability pressure focused attention on subgroup performance, then stagnated, and widened post-COVID as Black students experienced larger average losses than White students. A 25-point gap on the NAEP scale corresponds, roughly, to about two grade levels of learning — a sobering figure that has proved resistant to three decades of federal, state, and local policy interventions.
The income-based gap — measured by NAEP as eligibility for the National School Lunch Program, a proxy for family income below 185% of the federal poverty line — has been tracked since the 1990s and similarly hovers around 25–30 points. Post-COVID the income gap actually widened: students from lower-income families lost more ground on average than those from higher-income families, likely reflecting differential access to devices, broadband, tutoring, and adult supervision during remote learning periods.
English learner (EL) students and students with disabilities are included in NAEP samples with accommodations, though the accommodation policies have evolved over time in ways that introduce some comparability complications in trend analysis. NAEP reports both “all students” and “exclusion-adjusted” figures to support consistent longitudinal comparisons.
The State Comparison Problem: NAEP vs. State Assessments
One of NAEP's most important policy functions is exposing the gap between state-reported proficiency rates and NAEP-measured proficiency rates. Because states set their own proficiency standards on their own assessments, and because those standards have historically varied enormously, a state can report that 70% of its 4th graders are “proficient” in reading while NAEP shows only 30% reaching the NAEP Proficient benchmark. This discrepancy does not mean the state is lying; it means the state has set a lower bar than NAGB.
The most instructive case is what researchers call the “Mississippi paradox.” Mississippi historically had among the lowest NAEP scores in the country and also set relatively modest state proficiency standards. Beginning around 2013, Mississippi enacted significant literacy reforms — mandatory reading retention for students not reading at grade level by 3rd grade, intensive early literacy intervention, and curriculum improvements — while simultaneously raising its state proficiency standards to more closely align with NAEP. The result was that Mississippi's NAEP scores improved substantially over the following decade, while its state-reported proficiency rates fell (because the standard got harder). Mississippi went from near the bottom of the NAEP ranking to roughly the middle — a dramatic improvement that is invisible if you only look at changes in the state's own assessment results.
The inverse happened in other states: they kept state proficiency standards low, producing high state-reported proficiency rates while NAEP scores stagnated. NAEP is the only mechanism that cuts through this problem because every state takes the same test scored on the same scale.
The Plausible Values Methodology
NAEP uses a design feature called matrix sampling that dramatically reduces burden on individual students but introduces a methodological complexity that researchers must handle correctly. No student answers all NAEP items. Instead, students receive booklets containing a subset of items from a larger item pool. This allows NAEP to cover a broad content domain without requiring any individual student to spend more than about ninety minutes testing.
The consequence is that individual student scores cannot be estimated directly from their limited item responses with adequate precision. Instead, NCES uses item response theory and a conditioning model that incorporates both item responses and background questionnaire data to estimate each student's posterior score distribution. From this distribution, five “plausible values” are drawn as multiple imputations representing the student's likely proficiency.
For researchers working with NAEP public-use or restricted-use microdata, this matters enormously. Correct analysis requires running every statistical model on all five plausible values separately, then averaging the parameter estimates and combining standard errors using Rubin's rules for multiple imputation. Analyzing only the first plausible value, or averaging across plausible values before analysis, produces biased estimates of subgroup differences and incorrect standard errors. The NAEP Primer (NCES Technical Report 2014-800) documents the methodology in full. Most published NAEP tables — including everything on nationsreportcard.gov — already account for plausible values correctly; the issue arises only when researchers work with microdata directly.
Because NAEP is a population assessment rather than a diagnostic instrument, plausible values are valid for estimating group means, subgroup gaps, and distributional statistics, but they cannot be used to make statements about individual students. This is not a limitation of the design — it is a deliberate feature that allows NAEP to achieve broad content coverage without the testing burden that high-stakes individual assessment would require.
The NAEP Data Explorer API
NCES provides public access to NAEP results through the NAEP Data Explorer at nationsreportcard.gov/ndecore and through a REST API that the Data Explorer itself uses. The API endpoint for aggregate data is:
https://www.nationsreportcard.gov/Dataservice/GetAdhocData/variableListQuery parameters specify the subject (reading or mathematics), grade (4 or 8), subscale, variable (demographic breakdowns or total), jurisdiction (national, state, or TUDA district), statistic type (mean scale score, percentage at or above achievement levels), and year. The API returns JSON containing jurisdiction-level estimates and standard errors.
State-level results cover all 50 states plus the District of Columbia. TUDA district results are available for 27 urban districts. Results can be broken down by race and ethnicity, gender, eligibility for free or reduced-price lunch, disability status, English learner status, and school type (public vs. private in national samples).
The public Data Explorer also provides access to item-level statistics and released sample items, which are useful for understanding what the NAEP scale actually measures at different score levels. Restricted-use microdata — individual student-level item responses and background questionnaire data — are available to qualified researchers through an NCES restricted-use data license, subject to a data security plan and institutional review. The microdata are necessary for any analysis that goes beyond the published aggregate tables, including multilevel modeling, trend decomposition by school characteristics, or regression analysis of background-variable associations with achievement.
Python: Querying the NAEP Data Explorer API
The following script queries the NAEP Data Explorer API to retrieve state average scale scores for all four core subject-grade combinations (4th and 8th grade, reading and mathematics) from the 2022 assessment, then computes state rankings and identifies the states with the largest declines between 2019 and 2022 in 8th grade mathematics — the measure that showed the sharpest COVID-era deterioration nationally.
import requests
import pandas as pd
# NAEP Data Explorer REST API
# Fetch state-level average scale scores for 4th and 8th grade
# reading and mathematics from the most recent main NAEP assessment
BASE = "https://www.nationsreportcard.gov/Dataservice/GetAdhocData/variableList"
def fetch_naep(subject, grade, year=2022):
"""Return a DataFrame of state average scale scores for one subject/grade."""
params = {
"type": "data",
"subject": subject, # "reading" or "mathematics"
"grade": str(grade), # "4" or "8"
"subscale": "RRPCM" if subject == "reading" else "MRPCM",
"variable": "TOTAL",
"jurisdiction": "NT", # NT = all jurisdictions including states
"stattype": "MN:MN", # mean scale score
"Year": str(year),
}
resp = requests.get(BASE, params=params, timeout=60)
resp.raise_for_status()
data = resp.json()
rows = data.get("result", [])
df = pd.DataFrame(rows)
df = df[["Jurisdiction", "Year", "Value", "SE"]].copy()
df.columns = ["state", "year", "avg_score", "std_err"]
df["subject"] = subject
df["grade"] = grade
# Convert score to float; suppress strings like "Insufficient data"
df["avg_score"] = pd.to_numeric(df["avg_score"], errors="coerce")
df["std_err"] = pd.to_numeric(df["std_err"], errors="coerce")
return df.dropna(subset=["avg_score"])
# Pull all four subject-grade combinations
frames = []
for subj in ("reading", "mathematics"):
for gr in (4, 8):
frames.append(fetch_naep(subj, gr, year=2022))
results_2022 = pd.concat(frames, ignore_index=True)
# Rank states within each subject-grade combination
results_2022["rank"] = (
results_2022.groupby(["subject", "grade"])["avg_score"]
.rank(ascending=False, method="min")
.astype(int)
)
# Show top 10 states for 8th grade math
g8_math = (
results_2022[(results_2022["subject"] == "mathematics") & (results_2022["grade"] == 8)]
.sort_values("avg_score", ascending=False)
.head(10)
)
print("Top 10 states -- 8th grade mathematics (2022 NAEP)")
print(g8_math[["rank", "state", "avg_score", "std_err"]].to_string(index=False))
# --- COVID learning-loss delta: compare 2019 vs 2022 ---
frames_2019 = []
for subj in ("reading", "mathematics"):
for gr in (4, 8):
frames_2019.append(fetch_naep(subj, gr, year=2019))
results_2019 = pd.concat(frames_2019, ignore_index=True)
results_2019 = results_2019.rename(columns={"avg_score": "avg_score_2019"})
merged = results_2022.merge(
results_2019[["state", "subject", "grade", "avg_score_2019"]],
on=["state", "subject", "grade"],
how="inner",
)
merged["delta"] = merged["avg_score"] - merged["avg_score_2019"]
# States with the largest 8th grade math declines
worst = (
merged[(merged["subject"] == "mathematics") & (merged["grade"] == 8)]
.sort_values("delta")
.head(10)
)
print("\nStates with largest 8th grade math decline, 2019 to 2022")
print(worst[["state", "avg_score_2019", "avg_score", "delta"]].to_string(index=False))
A few implementation notes. The NAEP Data Explorer API is a public endpoint but is not officially documented as a stable public API — it supports the nationsreportcard.gov interface and its parameters can change between assessment cycles. The subscale codes (“RRPCM” for reading, “MRPCM” for math) refer to the composite scale for the main NAEP assessments. The stattype parameter “MN:MN” requests the mean scale score. Responses include standard errors, which should be used to assess whether apparent state-to-state differences are statistically significant before drawing policy conclusions — small apparent differences between adjacent states often fall within sampling error.
The delta computation comparing 2019 to 2022 is the most policy-relevant output. States that appear at the top of the largest-decline list are those where the pandemic had the most measurable impact on 8th grade mathematics achievement. The pattern in actual NAEP data shows that several large urban-majority states experienced disproportionate losses, consistent with the longer closure periods in those jurisdictions.
Connecting NAEP to the Broader Federal Education Data Ecosystem
NAEP measures achievement but does not explain it. Understanding what drives NAEP score patterns requires linking NAEP results to other federal education datasets:
- Common Core of Data (CCD) — the NCES census of all public elementary and secondary schools in the United States, providing enrollment, poverty rates, racial composition, and school characteristics. NAEP jurisdiction-level demographic data can be interpreted against CCD school-level context for a given state.
- IPEDS (Integrated Postsecondary Education Data System) — covers college enrollment, completion, and outcomes. NAEP 8th and 12th grade scores for a cohort can be compared to the IPEDS college-going rates several years later to study the pipeline from K–12 achievement to postsecondary outcomes.
- EDFacts — a collection of administrative data submitted by state education agencies to the Department of Education, including state assessment results, graduation rates, and chronic absenteeism. The EDFacts data allows direct comparison of state-reported proficiency rates to NAEP proficiency rates at the state level — the “honesty gap” analysis.
- BEA GDP and personal income accounts — state-level economic output provides the resource and fiscal context for education investment. States with higher per-capita income tend to have higher NAEP scores, though the relationship is far from deterministic: Mississippi's NAEP gains occurred without being among the highest-spending states.
Limitations and Appropriate Uses
NAEP is powerful but bounded. It measures a sample of students in a given grade in a given year — it does not follow cohorts longitudinally. Apparent trends between NAEP assessment years reflect changes in the population of students at that grade level, not necessarily learning growth or decline in specific students. Researchers interested in student-level growth need longitudinal administrative data from states, not NAEP.
NAEP scores cannot be disaggregated below the state or TUDA district level using public data. There are no NAEP scores for individual schools, districts (outside TUDA), or counties. This is a fundamental constraint of the sampling design: the sample size within a given state is typically 2,000–3,000 students, large enough for reliable state estimates but far too small for sub-state geography.
Finally, NAEP measures what it measures: knowledge and skill in specific academic domains as defined by NAGB framework committees. The frameworks are regularly updated and reflect considered expert judgment about what students should know, but they are not the only legitimate definition of educational success. Social-emotional learning, creative problem-solving, applied technical skill, and civic participation are not captured on the NAEP scale. Policymakers who optimize entirely for NAEP scores risk crowding out other educational objectives that matter.
NAEP tracks K–12 achievement; the federal dataset covering what happens after high school is IPEDS, the Integrated Postsecondary Education Data System, which reports college enrollment, completion, and institutional finances for every Title IV institution. See the writing index for coverage as it is added.
For county-level economic context around education investment and labor market returns, the Census Bureau's County Business Patterns provides annual establishment counts and payroll by industry down to the county level. See Census County Business Patterns: Establishment and Payroll Data by County and Industry.
State education spending and its relationship to economic output can be situated using BEA GDP accounts, which report state-level gross domestic product and personal income annually. See BEA GDP Accounts: State and National Economic Output from the Bureau of Economic Analysis.