Technical writing

OSHA 300A Injury and Illness Data: The Federal Database Behind Establishment-Level Workplace Injury Rates

· AI Analytics
OSHAWorkplace InjuriesSafetyLaborFederal Data

OSHA requires establishments with 20+ employees in high-hazard industries to submit annual injury and illness summary data — the OSHA 300A form — creating a public database of injury rates, Days Away from Work cases, and illness incidence rates by employer and location across 300,000+ establishments.

What the OSHA 300A Is

OSHA Form 300A is the Annual Summary of Work-Related Injuries and Illnesses. It is the establishment-level aggregate of every recordable injury and illness event that occurred at a worksite during a calendar year. Unlike the underlying Form 300 (the Log of Work-Related Injuries and Illnesses, which records each incident individually) or Form 301 (the Injury and Illness Incident Report, which captures narrative detail about each event), the 300A is a summary document — total counts of cases by type, total hours worked, and average employment. Both Form 300 and Form 301 are internal documents that employers must retain on-site for five years but are not submitted to OSHA and are not public. The 300A is the one form in the suite that becomes a public federal dataset.

Every covered establishment must submit 300A data electronically via OSHA's Injury Tracking Application (ITA) portal by March 2nd for the prior calendar year. “Covered” means establishments with 20 or more employees that fall within a high-hazard NAICS code list maintained in OSHA's regulations (Appendix A to 29 CFR Part 1904, Subpart E). The high-hazard list includes construction, manufacturing, agriculture, transportation, warehousing, healthcare, and several service sectors — industries where injury rates have historically exceeded the all-industry average. Establishments with 250 or more employees in any industry (not just high-hazard NAICS codes) also have their own submission requirements depending on the current regulatory framework.

The resulting public dataset at data.osha.gov contains approximately 750,000+ establishment records per annual filing cycle, covering an enormous cross-section of American industry from individual warehouse facilities and meat processing plants to hospital campuses and construction contractors. Annual CSV files covering submissions from 2016 to the present are available for download. The dataset is also accessible via a Socrata API athttps://data.osha.gov/api/views/62je-3e4m/rows.json. Annual data for a given calendar year is typically released between September and November of the following year, after OSHA completes its data processing and quality review cycle.

What 300A Data Contains

The 300A record for each establishment contains three categories of fields: establishment identifiers, employment and hours metrics, and injury and illness case counts broken down by type.

Establishment identifiers include estab_name, street_address,city, state, zip, naics_code, and estab_type. The establishment name field is the employer's own reported name for the facility, which means that a large employer with many sites will appear as dozens or hundreds of separate records — each representing a distinct physical location. This structure makes the dataset suitable for establishment-level comparisons within and across industries, and for identifying employer-wide patterns by aggregating on name or a corporate parent identifier derived from external sources.

The employment and hours metrics are annual_average_employees(the average number of employees working at the establishment during the year, used as the denominator in incidence rate calculations) andtotal_hours_worked (the aggregate hours worked by all employees at the establishment for the entire year, the preferred denominator for rate calculations that need to account for part-time and seasonal workforces).

The case count fields capture both injuries and illnesses. On the injury side:total_deaths (work-related fatalities), total_dafw_cases(Days Away From Work cases — the most serious injury category, involving injuries severe enough to keep a worker off the job beyond the day of the incident),total_djtr_cases (Days of Job Transfer or Restriction only — injuries that required modified duty but not time off), total_other_cases(recordable cases not involving days away or job transfer), andtotal_injury_cases (the total of all recordable injury cases across the three severity tiers). On the illness side: total_skin_disorder_cases,total_respiratory_condition_cases, total_poisoning_cases,total_hearing_loss_cases, and total_other_illness_cases.

A recordable case under OSHA's Part 1904 regulations is any work-related injury or illness that results in death, days away from work, restricted work activity, job transfer, medical treatment beyond first aid, loss of consciousness, or diagnosis of a significant injury or illness by a licensed healthcare professional. The threshold is deliberately broader than “hospitalized” or “serious” — a worker who requires prescription medication, physical therapy, or a work restriction due to an on-the-job injury generates a recordable case even if the injury appears minor by clinical standards.

Computing Injury Rates

The 300A data enables two standard industry metrics that OSHA and the Bureau of Labor Statistics use to compare injury experience across employers and industries.

The Total Recordable Case (TRC) rate is computed as (N / EH) × 200,000, where N is the number of recordable cases, EH is the total employee-hours worked, and 200,000 is the base representing 100 full-time workers at 40 hours per week for 50 weeks per year. Expressing the rate per 200,000 hours normalizes across establishments of different sizes and workforce compositions, making a 50-employee manufacturer directly comparable to a 5,000-employee distribution center.

The DART rate (Days Away from work, Restricted duty, and job Transfer) measures only the more serious injury events: ((total_dafw_cases + total_djtr_cases) / total_hours_worked) × 200,000. The DART rate excludes other recordable cases that required medical treatment but no time away from regular duties, making it a closer proxy for injuries with meaningful productivity and human cost.

Industry benchmark rates from the Bureau of Labor Statistics Occupational Injuries and Illnesses Survey (SOII) for 2022 provide context for interpreting 300A establishment rates. All private industry averaged a TRC rate of 2.7. Healthcare and social assistance averaged 5.5, driven by nursing home and hospital worker injury rates — patient handling, workplace violence, and needlestick exposures. Warehousing and storage averaged 5.9, with Amazon fulfillment centers documented at roughly double the warehousing average in multiple investigative analyses. Transportation and warehousing broadly averaged 5.1. Construction averaged approximately 3.0 and manufacturing 3.5. Agriculture and food processing, particularly meat and poultry processing, run among the highest rates in any industry category, historically exceeding 6.0–8.0 depending on the specific NAICS code and year.

Employer Naming Controversies and High-Profile Cases

The 300A dataset became a significant tool for investigative journalism and academic research on employer safety performance beginning around 2017, when electronic submission requirements expanded coverage and made year-over-year comparison feasible at scale.

Amazon warehouse injury rates attracted sustained scrutiny from The Washington Post, The New York Times, and independent researchers using 300A data. Analyses showed Amazon fulfillment center DAFW rates and TRC rates running at roughly twice the warehousing and storage industry average across multiple years. Amazon disputed the methodology, arguing that its establishment reporting practices and worker demographics made direct comparisons misleading — for example, that higher reporting rates at some facilities reflected better incident capture rather than more injuries, and that the workforce composition (including a higher proportion of newer workers, who have higher injury rates across all industries) skewed the numbers. The dispute illustrates a fundamental interpretive challenge in 300A analysis: the dataset captures reported injuries, not actual injuries, and an employer with aggressive injury reporting practices will appear worse than an employer that systematically suppresses recordable case classification.

Meat and poultry processing consistently appears at the top of 300A injury rate rankings. NAICS codes 3116 (animal slaughtering and processing) and 3117 (seafood product preparation) run TRC rates well above the manufacturing average, driven by repetitive motion injuries from high-speed line work, cuts and lacerations from knife and equipment handling, and the physical demands of cold-room work. The COVID-19 pandemic created an additional exposure challenge in 2020 — meat processing facilities became early outbreak sites due to crowded working conditions, poor ventilation, and cold temperatures that extended viral persistence, and illness cases from that period appear in the 300A illness category fields for those establishments.

Healthcare workers, particularly nursing home and long-term care facility employees, had the highest DART rates in the healthcare sector during the COVID-19 pandemic years visible in the 300A data. The combination of patient handling injuries (musculoskeletal disorders from lifting and transferring residents) and COVID-19 respiratory illness cases drove nursing home injury rates to extraordinary levels in 2020–2021. These establishments are identifiable in the dataset by NAICS 6231 (nursing care facilities) combined with elevated total_respiratory_condition_cases counts.

Separate from 300A annual reporting, OSHA maintains a Severe Injury Reporting (SIR) requirement under 29 CFR 1904.39 that requires employers to notify OSHA within 24 hours of an employer learning of an amputation, in-patient hospitalization, or eye loss involving any worker. Fatalities must be reported within 8 hours. The SIR database is a distinct dataset from the 300A and captures the most acute injury events — it is searchable at osha.gov and is often the first public signal of a severe incident before any OSHA inspection is completed.

Electronic Submission Requirement History

The existence of the 300A as a public federal dataset is the product of a regulatory history spanning three administrations and a series of contested rulemakings about the appropriate scope of electronic recordkeeping submission.

The Obama administration's 2016 rule (effective January 1, 2017) required electronic submission of all three recordkeeping forms — the 300A summary, the Form 300 injury log, and the Form 301 incident reports — for establishments with 250 or more employees in any industry and for establishments with 20 or more employees in high-hazard NAICS codes. The 300 and 301 submission requirements were controversial because they would have made individual incident narratives and worker-specific injury descriptions publicly available, raising privacy concerns about identifiable medical information appearing in a federal database.

The Trump administration's 2019 rule eliminated the Form 300 and Form 301 electronic submission requirements while retaining the 300A submission. This produced the current baseline: the annual 300A summary is public, but the underlying incident-level log and individual incident reports remain internal documents retained at the establishment. The public dataset therefore captures aggregate counts without any incident-level detail.

The Biden administration's 2023 rule (effective January 1, 2024) partially reinstated electronic submission of the Form 300 injury log, but with a narrower scope than the 2016 rule. The 2023 rule requires 300 Log electronic submission only for establishments with 100 or more employees in a subset of high-hazard NAICS codes listed in Appendix B to the rule — not all 20+ employee high-hazard establishments, and not Form 301. The 300 Log data for covered establishments is being collected by OSHA but as of 2024 is not publicly released in establishment-identifiable form; OSHA has indicated it intends to publish aggregate analyses rather than raw incident logs.

The practical result for data analysts: the 300A public dataset (2016–present, available at data.osha.gov) remains the only establishment-specific injury data source in the federal public record. Incident-level data requires a FOIA request to OSHA for the 300 Logs, which OSHA has produced in response to press and research requests, with individual-identifying fields redacted.

Data Access

The OSHA Establishment-Specific Injury and Illness Data is published athttps://www.osha.gov/Establishment-Specific-Injury-and-Illness-Data. Annual CSV files by year are available for download without registration. Each file covers one calendar year of 300A submissions and is typically 50–100 MB uncompressed. Years 2016 through the most recently completed filing cycle are available; the 2024 filing year data (for calendar year 2023 injuries) became available in late 2024.

The Socrata API endpoint athttps://data.osha.gov/api/views/62je-3e4m/rows.json supports filtered queries using Socrata Query Language (SoQL). Useful filters include state (?$where=state='TX'), NAICS prefix (?$where=naics_code like '23%' for all construction establishments), or a combination of NAICS and employee threshold (?$where=naics_code like '33%' AND annual_average_employees > 500for large manufacturers). The CSV export endpoint (rows.csv?accessType=DOWNLOAD) returns the same filtered result set in a format suitable for direct loading into pandas without an API key for public datasets. Annual data is typically released September through November for the prior calendar year.

A note on data quality: the 300A dataset contains self-reported figures that are not audited at submission time. OSHA conducts compliance audits at a sample of establishments each year and compares submitted 300A totals against the underlying 300 Log entries, citing recordkeeping violations when the counts do not match. But the vast majority of submissions are never verified. Analysts should treat establishments with implausibly low injury rates in high-hazard industries — particularly those with large employee counts and zero DAFW cases over multiple consecutive years — as candidates for underreporting rather than evidence of exceptional safety performance.

Python: Querying the OSHA 300A Dataset

The following script downloads the OSHA 300A ITA dataset from the data.osha.gov Socrata API and computes five analytical outputs: the DART rate distribution by 2-digit NAICS, the top 20 establishments by total Days Away From Work cases, states with the highest average TRC rates, the trend in industry total recordable case rates from 2016 through 2022, and a list of any establishment reporting fatalities in the most recent data year.

import requests
import pandas as pd
import io

# ---------------------------------------------------------------------------
# OSHA 300A Establishment-Specific Injury and Illness Data Analysis
# Socrata endpoint: https://data.osha.gov/api/views/62je-3e4m/rows.json
#
# Key fields:
#   estab_name              - establishment (employer) name
#   street_address, city, state, zip
#   naics_code              - NAICS industry code
#   annual_average_employees - average employee count during the year
#   total_hours_worked      - aggregate hours for all employees
#   total_deaths            - work-related fatalities
#   total_dafw_cases        - Days Away From Work cases
#   total_djtr_cases        - Days of Job Transfer or Restriction only
#   total_other_cases       - recordable cases without days away or restriction
#   total_injury_cases      - total recordable injuries
#   total_skin_disorder_cases, total_respiratory_condition_cases,
#   total_poisoning_cases, total_hearing_loss_cases, total_other_illness_cases
# ---------------------------------------------------------------------------

ITA_CSV = (
    "https://data.osha.gov/api/views/62je-3e4m/rows.csv?accessType=DOWNLOAD"
)

print("Downloading OSHA 300A ITA dataset...")
resp = requests.get(ITA_CSV, timeout=600, stream=True)
resp.raise_for_status()

df = pd.read_csv(
    io.BytesIO(resp.content),
    dtype=str,
    low_memory=False,
)
df.columns = df.columns.str.strip().str.lower().str.replace(" ", "_")

# Parse numeric fields
numeric_cols = [
    "annual_average_employees",
    "total_hours_worked",
    "total_deaths",
    "total_dafw_cases",
    "total_djtr_cases",
    "total_other_cases",
    "total_injury_cases",
    "total_skin_disorder_cases",
    "total_respiratory_condition_cases",
    "total_poisoning_cases",
    "total_hearing_loss_cases",
    "total_other_illness_cases",
]
for col in numeric_cols:
    if col in df.columns:
        df[col] = pd.to_numeric(df[col], errors="coerce")

if "naics_code" in df.columns:
    df["naics2"] = df["naics_code"].astype(str).str[:2]

# Drop rows with zero or missing hours (cannot compute rates)
df_valid = df[
    df["total_hours_worked"].notna() & (df["total_hours_worked"] > 0)
].copy()

# Total Recordable Case (TRC) rate = (N / EH) * 200,000
# DART rate = ((DAFW + DJTR) / EH) * 200,000
BASE = 200_000

df_valid["trc_rate"] = (
    df_valid["total_injury_cases"] / df_valid["total_hours_worked"] * BASE
)
df_valid["dart_rate"] = (
    (df_valid["total_dafw_cases"] + df_valid["total_djtr_cases"])
    / df_valid["total_hours_worked"]
    * BASE
)

# ---------------------------------------------------------------------------
# (a) DART rate distribution by 2-digit NAICS
# ---------------------------------------------------------------------------
if "naics2" in df_valid.columns:
    dart_by_naics = (
        df_valid[df_valid["dart_rate"].notna() & (df_valid["dart_rate"] < 100)]
        .groupby("naics2")
        .agg(
            estab_count=("dart_rate", "count"),
            median_dart=("dart_rate", "median"),
            mean_dart=("dart_rate", "mean"),
        )
        .reset_index()
        .sort_values("median_dart", ascending=False)
        .head(20)
    )
    dart_by_naics["median_dart"] = dart_by_naics["median_dart"].round(2)
    dart_by_naics["mean_dart"] = dart_by_naics["mean_dart"].round(2)
    print("\n=== DART Rate by 2-Digit NAICS (Top 20 by Median DART Rate) ===")
    print(dart_by_naics.to_string(index=False))

# ---------------------------------------------------------------------------
# (b) Top 20 establishments by total DAFW cases
# ---------------------------------------------------------------------------
if "total_dafw_cases" in df_valid.columns:
    top_dafw = (
        df_valid[df_valid["total_dafw_cases"].notna()]
        .nlargest(20, "total_dafw_cases")[
            [
                "estab_name",
                "city",
                "state",
                "naics_code",
                "annual_average_employees",
                "total_dafw_cases",
            ]
        ]
        .reset_index(drop=True)
    )
    print("\n=== Top 20 Establishments by Total Days Away From Work Cases ===")
    print(top_dafw.to_string(index=False))

# ---------------------------------------------------------------------------
# (c) States with highest average TRC rate
# ---------------------------------------------------------------------------
if "state" in df_valid.columns and "trc_rate" in df_valid.columns:
    state_trc = (
        df_valid[df_valid["trc_rate"].notna() & (df_valid["trc_rate"] < 100)]
        .groupby("state")
        .agg(
            estab_count=("trc_rate", "count"),
            mean_trc=("trc_rate", "mean"),
            median_trc=("trc_rate", "median"),
        )
        .reset_index()
        .sort_values("mean_trc", ascending=False)
        .head(15)
    )
    state_trc["mean_trc"] = state_trc["mean_trc"].round(2)
    state_trc["median_trc"] = state_trc["median_trc"].round(2)
    print("\n=== States with Highest Average TRC Rate (Top 15) ===")
    print(state_trc.to_string(index=False))

# ---------------------------------------------------------------------------
# (d) Trend in industry total recordable case rate 2016-2022
#     Requires a submission_year or year column; fall back to grouping
#     by whatever year field is available
# ---------------------------------------------------------------------------
year_col = None
for candidate in ["year", "submission_year", "reporting_year"]:
    if candidate in df_valid.columns:
        year_col = candidate
        break

if year_col:
    df_valid[year_col] = pd.to_numeric(df_valid[year_col], errors="coerce")
    trend = (
        df_valid[
            df_valid[year_col].between(2016, 2022)
            & df_valid["trc_rate"].notna()
            & (df_valid["trc_rate"] < 100)
        ]
        .groupby(year_col)
        .agg(mean_trc=("trc_rate", "mean"))
        .reset_index()
        .sort_values(year_col)
    )
    trend["mean_trc"] = trend["mean_trc"].round(2)
    print("\n=== Industry TRC Rate Trend 2016-2022 (All Establishments) ===")
    for _, row in trend.iterrows():
        bar = "#" * int(row["mean_trc"] * 5)
        print(f"  {int(row[year_col])}  {row['mean_trc']:>6.2f}  {bar}")
else:
    print("\nNo year column found; skipping trend analysis.")

# ---------------------------------------------------------------------------
# (e) Establishments with fatalities in the most recent data year
# ---------------------------------------------------------------------------
if "total_deaths" in df_valid.columns:
    fatal_rows = df_valid[df_valid["total_deaths"] > 0].copy()
    if year_col and year_col in fatal_rows.columns:
        most_recent = fatal_rows[year_col].max()
        fatal_rows = fatal_rows[fatal_rows[year_col] == most_recent]

    fatal_display = fatal_rows[
        [
            col for col in [
                year_col, "estab_name", "city", "state",
                "naics_code", "total_deaths", "total_dafw_cases",
            ]
            if col and col in fatal_rows.columns
        ]
    ].sort_values("total_deaths", ascending=False)

    print(f"\n=== Establishments Reporting Fatalities ({len(fatal_display)} records) ===")
    print(fatal_display.head(30).to_string(index=False))

The 200,000-hour base in the rate calculations corresponds to 100 full-time equivalent workers at 40 hours per week for 50 weeks. This normalization is standard across OSHA, BLS, and industry benchmarking — any rate figure from another source using the same base is directly comparable to rates derived from 300A data. Rates above 10.0 for an individual establishment warrant scrutiny: they may reflect a genuinely hazardous operation with accurate reporting, a small establishment with high variance due to low hours (a single injury at a 15-employee facility can produce an apparent rate of 15 or 20), or an arithmetic error in the submitted hours field.

To extend the fatality analysis: filter the full dataset tototal_deaths > 0 across all available years, group by NAICS and state, and compute the fraction of establishments that reported at least one fatality in any given year. Cross-reference with the OSHA Severe Injury Reporting database and the BLS Census of Fatal Occupational Injuries (CFOI) for the same period to triangulate completeness — the 300A fatality counts should track closely with BLS CFOI by industry, and material discrepancies by sector may indicate reporting gaps worth investigating.

Related writing

Related: OSHA violations database · MSHA mine safety data

Part of the Federal Regulatory Data Hub.