Technical writing

Every US traffic death since 1975: using NHTSA FARS to analyze road safety, vehicle defects, and enforcement gaps

· 13 min read· AI Analytics
Regulatory dataNHTSAFARSTraffic safetyVehicle safetyTransportation

Since 1975, the National Highway Traffic Safety Administration has maintained a structured record of every person who died in a motor vehicle crash on a United States public road. The Fatality Analysis Reporting System — FARS — is not a survey or a sample. It is a census. More than 1.1 million fatality records, each linked to a crash, a vehicle, and a person, each coded against a standardized schema that has been extended and refined across five decades. The dataset is public, freely downloadable, and almost entirely unread outside of academia and NHTSA's own research division.

What FARS is and what it covers

FARS was established by Congress under the Highway Safety Act and has been operational since January 1, 1975. Every state, the District of Columbia, and Puerto Rico submit fatality data to NHTSA. State highway safety offices collect crash reports from law enforcement agencies, coroner and medical examiner records, emergency medical services data, and vehicle registration files, then code that information into the FARS data elements and transmit it to NHTSA. The federal agency performs consistency checks, resolves discrepancies with the states, and publishes an annual data release covering each calendar year.

The definition of a FARS case is precise: a crash must involve at least one motor vehicle traveling on a public trafficway, and at least one person must die within 30 days of the crash as a result of injuries sustained in it. That 30-day window is the fatality threshold. A pedestrian struck on a private parking lot is not a FARS case. A cyclist killed on a park trail is not a FARS case. A driver who dies 31 days after a crash from injuries sustained in it is not counted. These boundaries matter for interpretation, and they create the dataset's most significant limitation: FARS systematically excludes a class of road deaths that occur off public trafficways and undercounts long-latency deaths from traumatic brain injury and internal injuries that cross the 30-day threshold.

Within those boundaries, FARS is comprehensive. NHTSA estimates that it captures effectively 100 percent of qualifying fatalities. The agency cross-validates annually against state vital statistics death records and Fatality Management Information System (FMIS) data. Discrepancies are investigated; the published dataset reflects the resolved count. FARS is unusual among federal regulatory datasets in that its completeness is not meaningfully disputed — the definitional exclusions are by design, not by measurement failure.

Data scope: 30+ interconnected files per year

Each annual FARS release is not a single file. It is a relational database distributed as a collection of flat files, with each file covering a different analytical unit. As of the most recent releases, a complete year of FARS data comprises more than 30 separate tables. The core files are:

  • ACCIDENT — one row per crash. Contains crash date, time, location (state, county, city, route, GPS coordinates in recent years), road type, manner of collision, first harmful event, weather conditions, light conditions, and summary counts of vehicles, persons, and fatalities involved. The ST_CASE field is the crash identifier; all other files join to it.
  • VEHICLE — one row per motor vehicle involved in a qualifying crash. Contains vehicle make, model, model year, body type, special use designation, travel speed, pre-crash maneuver, first harmful event, vehicle role (striking or struck), and whether the vehicle was involved in a hit-and-run. Joins to ACCIDENT on ST_CASE; vehicle number within the crash isVEH_NO.
  • PERSON — one row per person involved (occupants, pedestrians, cyclists, and other non-motorists). Contains age, sex, injury severity, seating position, restraint use, alcohol test result, drug test result, ejection status, and airbag deployment. Joins to ACCIDENT and VEHICLE onST_CASE and VEH_NO.
  • CEVENT — crash events in sequence. Each crash can involve multiple harmful events (e.g., vehicle runs off road, then strikes a tree, then rolls over). This file captures the event sequence with an object struck and area-of-impact code for each event.
  • FACTOR — contributing factors coded per vehicle. Includes driver-related factors (inattention, fatigue, improper driving behaviors), vehicle-related factors (tire failure, brake failure, cargo shift), and environmental factors.
  • DISTRACT — driver distraction codes per vehicle and driver. Includes cell phone use, passenger distraction, looking away from road, and other attention-related factors. This file was substantially expanded after 2009 as distracted driving emerged as a policy priority.
  • MANEUVER — pre-crash vehicle maneuvers, coded separately from FACTOR. Captures lane changes, turning movements, and other kinematic actions immediately preceding the crash.
  • PBTYPE — pedestrian and bicyclist type and crash type for non-motorist fatalities. Contains the non-motorist's location relative to the roadway, their action, and the crash type (e.g., backing vehicle, turning vehicle, running red light).
  • RACE — race and ethnicity for persons involved in the crash, added in more recent releases. Coverage is incomplete in earlier years due to variation in how states collected and reported this field.
  • SAFETYEQ — safety equipment use per person, expanding on the restraint use field in PERSON with additional detail on helmet use, child safety seat type, and equipment condition.
  • WEATHER, VISION,PARKWORK, VEHNIT,NMCRASH — additional supporting files covering atmospheric conditions, sight obstructions, parked vehicles involved in crashes, non-in-traffic vehicles, and non-motorist crash typing respectively.

A full analytical schema joins all of these files into a denormalized wide table, but for most questions it is sufficient to work with ACCIDENT, VEHICLE, and PERSON. The join keys are consistent across all files and all years: STATE (FIPS state code),ST_CASE (case number unique within state and year), VEH_NO(vehicle number within case), and PER_NO (person number within vehicle).

Key variables

The variables most relevant for policy and safety analysis cluster across the three core files:

From ACCIDENT: crash date and time (YEAR,MONTH, DAY, HOUR, MINUTE); geographic identifiers (STATE, COUNTY, CITY, LATITUDE,LONGITUD in recent releases); road type (RUR_URB rural vs. urban,FUNC_SYS functional classification, NHS NHS designation); manner of collision (MAN_COLL); first harmful event (HARM_EV); and drunk driving flag (DRUNK_DR).

From VEHICLE: make and model (MAKE,MODEL — coded, not free text; NHTSA publishes a lookup table); model year (MOD_YEAR); body type (BODY_TYP); special use (SPEC_USE — flags emergency vehicles, transit, school bus, farm equipment); travel speed (TRAV_SP); hit-and-run indicator (HIT_RUN); and vehicle role (ROLLOVER, DEFORMED).

From PERSON: age (AGE); sex (SEX); injury severity (INJ_SEV on a five-point scale, with 4 being fatal); seating position (SEAT_POS); restraint use (REST_USE); air bag deployment (AIR_BAG); ejection status (EJECTION); alcohol test result (ALC_RES) and alcohol test type (ALC_STATUS); drug involvement (DRUGSPEC); and person type (PER_TYP distinguishing driver, passenger, pedestrian, cyclist, and other non-motorist).

50-year trends worth analyzing

FARS spans more than half a century, making it one of the longest continuous behavioral and infrastructure safety time series in US federal data. Four macro-trends stand out in a longitudinal analysis:

Drunk driving deaths. In 1982, NHTSA estimates that alcohol-impaired driving was involved in approximately 60 percent of all traffic fatalities. By 2023, that share had fallen to roughly 28 percent — a decline widely attributed to a combination of the national 21-year drinking age (adopted by all states by 1988 under threat of federal highway fund withholding), the per se 0.08 BAC legal limit (federally mandated in 2000), and increased law enforcement through sobriety checkpoint programs. FARS captures this directly: the DRUNK_DR field in ACCIDENT records whether any driver in the crash was alcohol-impaired; theALC_RES field in PERSON records blood alcohol concentration where tested. The trend is visible in a simple annual groupby. The absolute number of drunk driving deaths remains above 10,000 per year — a number that would be considered a crisis in any other category of preventable death.

Seat belt non-use. The REST_USE field in PERSON documents restraint use for every fatality. In years before mandatory seat belt laws (the first state law was New York's in 1984), the dataset captures a period when seat belt non-use was the norm among fatally injured occupants. The trend across five decades shows a substantial increase in belt use among fatalities, but a persistent tail of unbelted deaths concentrated among adult male drivers, pickup truck occupants, and rural crashes. As of the most recent data, approximately 45 percent of passenger vehicle occupant fatalities involved unbelted persons — in a country with primary seat belt laws in 34 states.

Speeding-related fatalities. NHTSA codes a crash as speeding-related if any driver in it was charged with a speeding offense, or if NHTSA determines that racing, driving too fast for conditions, or exceeding the posted limit was a contributing factor. The SPEEDREL field in ACCIDENT records this determination. Speeding-related fatalities have remained stubbornly stable as a share of total fatalities — approximately 29 percent — despite the general decline in overall traffic deaths between 1972 and 2014. That period saw total fatalities fall from 54,000 to 32,700; speeding's share barely moved.

The pedestrian and cyclist death surge since 2009. This is the most alarming trend in modern FARS data and the one that has attracted the most recent policy attention. After decades of declining pedestrian fatalities, the number reversed in 2009 and has climbed nearly continuously since. By 2022, pedestrian deaths reached 7,508 — the highest count since 1981. Several structural factors appear in the data. The shift toward large SUVs and light trucks, which now constitute the majority of new vehicle sales, correlates with higher pedestrian fatality severity when a crash occurs (higher hood height, greater front mass, reduced pedestrian visibility). Smartphone-era distracted driving begins precisely in the 2007–2009 period when the trend reversed. And the LGT_CONDfield in ACCIDENT shows that an increasing share of pedestrian fatalities occur in dark conditions without street lighting — a finding consistent with reduced infrastructure investment in lower-income areas.

How to get the data

NHTSA distributes FARS data through its FTP server at nhtsa.gov/file-downloads. Each annual directory contains files in two formats: SAS transport format (the legacy format used by NHTSA researchers) and CSV. For most purposes, the CSV files are preferable — each table in the annual release is a separate CSV named after the file type (ACCIDENT.CSV, VEHICLE.CSV, PERSON.CSV, etc.).

The release cadence is annual, typically published 8 to 12 months after the reference year ends. A “final” annual file is released first; an “annual report file” with additional derived variables follows. For most analytical purposes, the final annual file is sufficient. NHTSA also publishes an early estimates report in the second quarter of the following year using a preliminary methodology.

For interactive query access without bulk download, NHTSA provides the FARS Query System at cdan.dot.gov/query, which allows web-based pivot queries on key dimensions. This is useful for quick fact-finding but not for programmatic analysis or multi-year trend work.

The companion dataset for non-fatal crashes is the Crash Report Sampling System (CRSS), which replaced the National Automotive Sampling System (NASS-GES) in 2016. CRSS is a stratified probability sample of police-reported crashes (not limited to fatalities), designed to produce national estimates of crash frequency, injury severity, and contributing factors across the full severity spectrum. CRSS data is available from the same NHTSA FTP directory and is necessary for any analysis that requires a denominator (e.g., fatality rate per 100 crashes of a given type).

Three research use cases

State-by-state pedestrian fatality rates adjusted for VMT. Raw pedestrian death counts favor high-population states. The analytically meaningful comparison adjusts for vehicle miles traveled (VMT), which FHWA publishes by state in the Highway Statistics series. A simple join of FARS pedestrian fatalities by state and year against FHWA VMT by state and year produces a pedestrian deaths per 100 million VMT metric. That metric surfaces a striking geographic pattern: Florida, New Mexico, South Carolina, and Louisiana consistently rank among the highest-rate states despite varying population size and urbanization. The common factors — warmer climate encouraging year-round walking, higher proportions of older pedestrians, and road designs optimized for vehicle throughput rather than pedestrian crossings — are all legible in FARS fields (AGE, LGT_COND, FUNC_SYS) and geographic aggregations.

Teen vs. elderly driver crash patterns. TheAGE field in PERSON combined with PER_TYP (driver) allows direct comparison of crash circumstances by age cohort. Teen drivers (16–19) and elderly drivers (75+) both have elevated fatality rates per licensed driver, but the pattern of contributing factors differs sharply. Teen driver fatalities concentrate at night, on weekends, in crashes involving speed and multiple occupants, with a lower rate of seatbelt use and a higher rate of distraction-coded crashes. Elderly driver fatalities concentrate in daylight hours, at intersections, in crashes coded as failure to yield and improper turning — consistent with deteriorating reaction time and visual field rather than risk-taking behavior. The FACTOR and MANEUVER files provide the contributing factor coding; PBTYPE provides the crash type for cases where an elderly driver struck a pedestrian.

Large truck vs. passenger vehicle crash severity. The BODY_TYP field distinguishes large trucks (single-unit trucks over 10,000 lbs gross vehicle weight rating, combination trucks, and semi-trailers) from passenger vehicles. A join of VEHICLE to PERSON on ST_CASE andVEH_NO allows analysis of fatality rate by vehicle type for the occupants of each vehicle in a crash. The finding is stark: in large truck/passenger vehicle crashes, roughly 97 percent of fatalities are passenger vehicle occupants or non-motorists. Truck occupants are rarely killed. This severity asymmetry reflects mass incompatibility: a collision between a 40-ton combination truck and a 3,500-pound passenger car concentrates the energy almost entirely in the lighter vehicle. The CRSS companion dataset allows extension of this analysis to non-fatal crashes, producing injury severity distributions rather than just fatality counts.

Cross-referencing with NHTSA recalls

NHTSA is responsible for both FARS and the federal vehicle recall system under 49 U.S.C. § 30118. The recall database — covering every safety-related recall since the 1960s — is available via the NHTSA API at api.nhtsa.gov/recalls/recallsByVehicle. The join between FARS and the recall database requires matching on vehicle year, make, and model — exactly the fields present in the VEHICLE file.

The analytical question this join enables: for crashes involving a specific vehicle make, model, and model year, what fraction of those vehicles were subject to an open recall at the time of the crash? An open recall is one where the recall campaign has been announced but the remedy has not been applied to the specific vehicle. NHTSA's Vehicle Identification Number (VIN) decoder API at api.nhtsa.gov/vehicles/decodevinvaluesextendedaccepts a VIN and returns make, model, model year, and all associated recall campaigns. FARS includes VINs in the VEHICLE file for recent years (coverage improved substantially after 2010), enabling direct VIN-level recall matching rather than make/model/year approximation.

This cross-reference has investigative value in specific contexts. A crash involving a vehicle with an unrepaired open recall for a steering component failure, where the crash first harmful event is coded as loss of control, is a different kind of event than the same crash without that background. NHTSA publishes its own defect investigation records and preliminary evaluation documents in the NHTSA Safety Issues database, which provides the investigation record that preceded each recall campaign. Connecting FARS crashes to recall campaigns to defect investigation documents creates a chain of evidence that is not assembled in any single public source.

import pandas as pd
import requests

# Load FARS VEHICLE file for a given year
vehicle = pd.read_csv("VEHICLE.CSV")
# Keep only cases with a usable VIN (10+ characters, recent years)
vehicle = vehicle[vehicle["VIN"].str.len() >= 10].copy()

# NHTSA VIN decoder: returns associated recall campaigns
def get_recall_campaigns(vin):
    url = f"https://api.nhtsa.gov/recalls/recallsByVehicle"
    # Note: use the VIN-based lookup endpoint
    resp = requests.get(
        "https://api.nhtsa.gov/vehicles/decodevinvalues/" + vin,
        params={"format": "json"},
        timeout=10,
    )
    data = resp.json().get("Results", [{}])[0]
    return {
        "make": data.get("Make", ""),
        "model": data.get("Model", ""),
        "year": data.get("ModelYear", ""),
    }

# Check recalls for a specific make/model/year combination
def get_recalls_for_vehicle(make, model, year):
    resp = requests.get(
        "https://api.nhtsa.gov/recalls/recallsByVehicle",
        params={"make": make, "model": model, "modelYear": year},
        timeout=10,
    )
    results = resp.json().get("results", [])
    return len(results), [r.get("Component", "") for r in results]

# Example: find crashes involving vehicles with brake-related recalls
# First aggregate VEHICLE to make/model/year level
agg = (
    vehicle.groupby(["MAKE", "MODEL", "MOD_YEAR"])
    .size()
    .reset_index(name="crash_count")
    .sort_values("crash_count", ascending=False)
    .head(50)
)

# Then check recalls for each
# (In production, cache API responses to avoid rate limiting)
agg[["recall_count", "components"]] = agg.apply(
    lambda row: pd.Series(
        get_recalls_for_vehicle(row["MAKE"], row["MODEL"], row["MOD_YEAR"])
    ),
    axis=1,
)

Limitations and researcher notes

Public road only. FARS records crashes on public trafficways. A collision in a private parking lot, on a private farm road, or on a gated community street is not a FARS case regardless of severity. This exclusion is substantial: parking lot crashes are common and occasionally fatal; private road crashes in rural areas are underreported in general and entirely excluded from FARS by definition.

The 30-day fatality threshold. A person who dies 36 days after a crash from injuries sustained in it is not counted in FARS. The clinical threshold is relatively permissive compared with, say, the 28-day window used in some international road safety databases, but it still creates a systematic undercount of long-latency deaths from traumatic brain injury, spinal cord injury, and internal organ damage. NHTSA estimates that the 30-day threshold excludes between 2 and 5 percent of deaths that result from crash injuries.

Non-crash road deaths. A pedestrian who suffers a fatal cardiac event while crossing a street is not a FARS case unless a vehicle was involved in a crash. Medical emergencies that cause a driver to lose control and result in a single-vehicle crash are FARS cases; medical emergencies that result in a driver pulling to the side of the road and dying there are not. This boundary is clear in the regulations but creates ambiguity in cases where the medical event's relationship to the crash sequence is uncertain.

Alcohol and drug test gaps. The ALC_RESfield records blood alcohol concentration when a test was performed. Test performance is not universal: it depends on state law, law enforcement procedures, and whether the driver survived the crash. Drivers who die at the scene are more consistently tested than drivers who survive; pedestrians who die are more consistently tested than those who survive. The result is a selection bias in the alcohol and drug data: the tested population is not random, and high BAC values are overrepresented among deceased drivers relative to the true at-fault driving population.

Coded variable lookup tables. Most categorical variables in FARS are coded as integers, not strings. The MAKE field is a numeric code, not “Ford” or “Toyota.” NHTSA publishes a FARS Analytical User's Manual and associated lookup tables (the “SAS formats” file) that map every code to its label. These must be joined to the raw files before any label-based analysis. The manual is updated with each annual release; code values can be added or reassigned across years, so version-matched lookups are essential for multi-year work.

Cross-references and companion datasets

FARS exists within a broader NHTSA and federal transportation data ecosystem. Three cross-references are particularly productive:

NHTSA recalls and defect complaints. As described above, the recall API and the NHTSA Safety Issues database enable vehicle-level linkage between crash records and recall campaigns. The defect complaint database (accessible at api.nhtsa.gov/complaints/complaintsByVehicle) provides the pre-recall signal: consumer complaints about vehicle behavior that may precede a formal recall investigation. Joining complaint accumulation curves against FARS crash dates for a specific vehicle can surface cases where complaints were filed but no recall followed.

NTSB accident investigations. For crashes that rise to the level of NTSB investigation — typically highway crashes involving multiple fatalities, commercial vehicles, or systemic safety questions — NTSB publishes detailed accident reports that go far beyond the FARS record. NTSB's accident database is searchable at ntsb.gov and covers aviation, rail, marine, and highway modes. Highway accident reports include vehicle inspection findings, driver records, road geometry analysis, and reconstruction diagrams that provide context unavailable in FARS.

EPA vehicle fuel economy data. The EPA's fuel economy dataset (available at fueleconomy.gov/feg/download.shtml) covers every vehicle make, model, model year, and trim since 1984, with EPA city and highway MPG ratings, engine displacement, and vehicle class. Joining EPA vehicle class to FARS vehicle make/model/year enables analysis of fatality patterns by vehicle class — distinguishing, for example, between midsize SUVs and full-size pickup trucks within the broad “light truck” category that FARS body type codes sometimes collapse together.

Related writing

For the CPSC product safety recall database — how voluntary recalls are negotiated, what the return rate data reveals about recall effectiveness, and how to cross-reference against SEC EDGAR disclosures: The recall record: what the CPSC product safety database shows and what manufacturers hide →

Inside NIBRS: how the FBI's incident-based crime data works and what it actually measures — Another federal safety dataset where the unit of analysis is an incident record, the coverage is near-universal but definitionally bounded, and the coding schema rewards careful reading of the technical documentation before any aggregation.

The asylum lottery: what EOIR data reveals about judge-by-judge grant rate disparities — A case study in federal administrative data where a single field — the assigned judge — predicts outcomes more strongly than the underlying facts of the case, in a system that is nominally uniform.