No single OSHA file tells you whether a workplace is dangerous. The inspection that brought a compliance officer through the door lives in one dataset; the citations that visit produced live in another; the amputation an employer had to phone in within a day lives in a third; and the plant's own annual tally of injuries and illnesses lives in a fourth. Assemble them and a single establishment's safety history finally comes into one view—and you can ask the question the separate files cannot: after OSHA inspects and cites a workplace, do its workers actually get hurt less often?
This article is a guide to building that view. It covers how OSHA actually works—why it reaches only a fraction of workplaces each year and how a workplace gets selected for inspection; the four datasets in turn (the inspections, the violations and citations, the severe-injury reports, and the Form 300A establishment summaries) and what each row means; the two join keys—the OSHA inspection (activity) number that ties citations to their parent inspection, and the employer establishment that ties inspections, injuries, and 300A summaries together; the establishment-matching problem that is the real work, because those employer identifiers are not standardized across forms; the analytic questions the assembled data answers, above all whether enforcement lowers injury rates afterward; a Python workflow that pulls an establishment's inspections and citations, attaches its severe-injury reports and 300A rates, and compares its injury rate before and after an inspection; and the caveats—thin coverage, self-reporting, the State Plan gap, and the matching error—that every analyst must internalize before drawing a conclusion.
How OSHA actually works
The Occupational Safety and Health Administration, created by the Occupational Safety and Health Act of 1970 and housed in the Department of Labor, is responsible for the safety of roughly 130 million workers across millions of workplaces—with on the order of two thousand compliance officers across the federal agency and the state programs combined. The arithmetic is unforgiving: OSHA inspects only a small fraction of workplaces in any given year, and a typical workplace can statistically expect a visit on the order of once a century or less under purely random scheduling. Because the agency cannot be everywhere, it does not try to be. Its entire enforcement model is built around targeting—directing a scarce inspection resource at the workplaces where harm is most likely or has already occurred. Understanding that selection logic is the precondition for reading any of the four datasets honestly, because the records are not a sample of American workplaces; they are a sample of the workplaces OSHA chose to look at, for reasons that are themselves recorded.
Inspections arrive through a small number of channels, and the inspection record codes which one applied. A worker complaint or a referral (from another agency, a media report, or a compliance officer who saw a hazard from the street) triggers many inspections. Programmed inspections are the planned ones, driven by National and Local Emphasis Programs that select high-hazard industries—construction falls, grain handling, amputations in manufacturing—and by data-driven targeting that uses establishments' own reported injury rates to pick the worst performers. And a fatality or catastrophe—a worker death, or a severe injury reported to OSHA—frequently opens an inspection on its own. This last channel is what wires the datasets together causally rather than merely by key: a severe injury an employer reports can become the reason an inspection opens, which produces the citations, which the employer must then abate, which (in theory) shows up later as a lower injury rate on its 300A summary. The four files are snapshots of four stages of one process.
The four datasets
The assembled view rests on four distinct OSHA datasets, each with its own grain and its own identifier. In our database they are stored as osha_inspections, osha_violations, osha_severe_injuries, and osha_300a. Each maps cleanly onto one of the stages above. The inspections table holds one row per inspection OSHA conducted, identifying the workplace by employer name and address and recording when the inspection happened, why it was opened, the industry (NAICS), and its scope and outcome. The violations table holds one row per citation item—the specific standard violated, the violation type, the proposed and current penalty, the abatement deadline—and links to its parent inspection by the OSHA inspection number. The severe-injury reports table holds one row per amputation, in-patient hospitalization, or eye loss that an employer was required to report, with the employer, location, industry, and the body part, nature, and source of the injury. The 300A summaries table holds one row per establishment per year: the annual counts of injuries and illnesses, the cases with days away or restriction, the hours worked, and the average employment—the inputs to a workplace's injury rate.
The columns that matter for the join, drawn from across the four tables, are these:
-- osha_inspections (one row per inspection)
activity_nr -- OSHA inspection number; THE join key for citations
estab_name -- employer / establishment name (as recorded)
site_address, site_city, site_state, site_zip
naics_code -- industry of the inspected workplace
open_date -- date the inspection opened
insp_type, insp_scope -- why it opened (complaint, programmed, fatality...) and how broad
-- osha_violations (one row per citation item)
activity_nr -- links each citation back to its parent inspection
citation_id -- citation and item number within the inspection
standard -- the CFR standard cited (e.g. 19260501 -> 29 CFR 1926.501)
viol_type -- S=Serious, W=Willful, R=Repeat, O=Other-than-Serious
initial_penalty, current_penalty
abate_date -- deadline by which the hazard must be corrected
-- osha_severe_injuries (one row per reported event, since 2015)
employer, address, city, state, zip, naics_code
event_date -- when the amputation / hospitalization occurred
nature_title, part_of_body_title, source_title -- what happened, to what, by what
hospitalized, amputation -- which reporting trigger applied
-- osha_300a (one row per establishment-year)
establishment_name, street_address, city, state, zip, naics_code
year_filing_for
annual_average_employees, total_hours_worked
total_dafw_cases -- days-away-from-work cases
total_djtr_cases -- job-transfer / restriction cases (DART = DAFW + DJTR)Two columns carry the weight. The activity_nr—the OSHA inspection (activity) number—is the clean, machine join: every citation in osha_violations belongs to exactly one inspection in osha_inspections, and the activity number ties them together unambiguously. That is the easy half. The hard half is the establishment: inspections, severe-injury reports, and 300A summaries all identify the employer by name, address, and—where available—an establishment identifier, but those identifiers are not fully standardized across the forms. The same plant may appear as “ACME Meat Packing, Inc.” on its 300A, “Acme Meat Packing” on an inspection, and “ACME MEAT PACKING CO” on a severe-injury report, at three slightly different renderings of the same street address. So the real engineering is not parsing the four sources—each is a well-behaved flat file—but aligning the inspection numbers across the citation join and, far harder, resolving the employer establishment across the other three.
The inspection and its citations
An inspection is the event; the citations are its output. When a compliance officer walks a site—whether prompted by a complaint, a programmed emphasis target, or a fatality—the inspection record captures the encounter: who was inspected, where, in what industry, when, why, and how broadly. Most of the analytic value of the inspection table on its own lives in those framing fields. The insp_type and scope distinguish a narrow, complaint-driven look at one hazard from a comprehensive wall-to-wall inspection, and the opening reason separates the inspections OSHA initiated as planned enforcement from the ones forced on it by an injury or death. An analyst who ignores these fields and treats every inspection as equivalent will badly misread the data, because a programmed inspection of a randomly selected high-hazard establishment and a fatality inspection of a workplace where someone has already died are not the same observation.
The citations carry the substance. Each row in osha_violations records one cited item—the specific standard the workplace violated, encoded as a CFR reference (for example the construction fall-protection standard, 29 CFR 1926.501, the most-cited standard in the country for over a decade), the violation type, and the penalty. The violation type is legally and analytically central: a serious violation means a substantial probability of death or serious harm that the employer knew or should have known about; a willful violation means intentional disregard or plain indifference; a repeat means the same employer was cited for a substantially similar condition before. Willful and repeat citations carry the highest penalties and the gravest legal weight. The proposed penalty on a citation is rarely what the employer pays—informal settlement, size and good-faith adjustments, and review-commission decisions routinely reduce it—which is why comparing the proposed penalty on the citation to the actual harm recorded in the injury data is one of the questions the assembled view exists to ask.
Severe injuries and the 300A summary
The two injury datasets approach harm from opposite directions. The severe-injury reports are events. Since the reporting rule expanded on January 1, 2015 (29 CFR 1904.39), employers must report every work-related amputation, in-patient hospitalization, and loss of an eye to OSHA within roughly a day—hospitalizations within 24 hours of learning of them, fatalities within 8. Each report is one severe event, time-stamped and coded by the affected body part, the nature of the injury, and its source. This is the closest thing OSHA has to a near-real-time stream of serious harm, and because a severe-injury report can itself trigger an inspection, it is frequently the upstream cause of the inspection-and-citation records for the same employer. The reports are self-reported, however, which is both their strength (they do not depend on OSHA happening to inspect) and their central weakness (they depend on the employer choosing to report).
The Form 300A summary is the denominator the events lack. Every covered establishment must keep a log of recordable injuries and illnesses (the OSHA 300 log) and post an annual summary—the 300A—of the year's counts, and larger and higher-hazard establishments must submit those summaries electronically to OSHA through the Injury Tracking Application. The 300A is not a list of events; it is a tally: total recordable cases, cases with days away from work, cases with job transfer or restriction, total hours worked, and average employment for the year. Those last two are what make the 300A indispensable—they let an analyst convert raw counts into a rate. The standard injury rates (the total recordable incident rate and the DART rate—days away, restricted, or transferred) are computed as cases multiplied by 200,000, the annual hours of a hundred full-time workers, divided by hours actually worked. A rate, unlike a count, is comparable across establishments of different size and across the same establishment year over year—which is precisely what a before-and-after test of an inspection requires.
The join: activity numbers and establishments
The assembled record is built on two joins of very different difficulty. The first is exact and mechanical: citations to inspections by the activity number. Every citation in osha_violations carries the activity_nr of the inspection that produced it, and every inspection in osha_inspections has that same number as its key. This join is lossless and deterministic; it is the part of the pipeline that does not keep an analyst up at night. It turns a bare inspection into a fully specified enforcement event—not just “OSHA inspected this plant in March” but “OSHA inspected this plant in March, cited it for six serious fall-protection violations and one willful lockout violation, and proposed a penalty of a particular size with abatement due by a particular date.”
The second join is the hard one: aligning the same workplace across inspections, severe-injury reports, and 300A summaries by employer establishment. There is no shared activity number across these three—they identify the workplace by name, address, and where available an establishment identifier, and those identifiers are not consistent across forms or even within a form over time. A meatpacking plant can carry a different legal-entity suffix on its 300A than on its severe-injury report, a typo or an abbreviation in its street address, a parent-company name on one form and a doing-business-as name on another. Matching them is an entity-resolutionproblem, not a key lookup. A workable pipeline normalizes employer names (uppercasing, stripping punctuation and corporate suffixes like “Inc” and “LLC”), standardizes addresses to a consistent form, blocks candidate matches by ZIP code or city to keep the comparison tractable, and then scores name-and-address similarity to decide which rows describe the same physical establishment. The quality of this matching is the single largest determinant of whether the assembled record is trustworthy—match too loosely and you fuse distinct workplaces; match too strictly and you split one workplace's history across several phantom establishments and lose exactly the before-and-after linkage you were trying to build.
What the assembled data answers
Joined, the four datasets answer the questions that sit behind worker-safety policy and that no single file can address. The headline question is the deterrence question: do inspected and cited establishments see fewer injuries afterward? Because the 300A supplies a year-by-year injury rate and the inspection supplies a dated enforcement event, an analyst can line them up for the same establishment and compare the rate in the years before the inspection to the rate in the years after—the empirical core of every debate about whether OSHA enforcement actually improves safety or merely punishes after the fact. Done carefully, with a comparison group of similar un-inspected establishments to absorb the general downward trend in injury rates and the regression-to-the-mean that follows any spike, this is the most policy-relevant use of the assembled data.
The data also answers which industries and employers carry the worst records—ranking NAICS sectors and named establishments by severe-injury counts, by citation severity, and by 300A injury rates, and surfacing the multi-establishment firms whose repeat citations reveal a systemic failure rather than a single bad site. It lets an analyst weigh proposed penalties against the harm done, setting the dollar figure on a citation beside the amputations and hospitalizations reported at the same workplace—the comparison that animates the long-running criticism that OSHA penalties are too small to deter. And it measures how much of the hazardous economy OSHA's thin coverage reaches: by joining the severe-injury stream (which does not depend on OSHA inspecting) to the inspection record (which does), one can estimate how many serious injuries occur at workplaces OSHA never visits, and in which industries the gap between harm and enforcement is widest. That last figure is the empirical statement of the resource constraint the agency operates under.
Python workflow: assembling one establishment's record
The script below assembles a single employer's record from all four sources. It reads the OSHA enforcement bulk files (inspections and violations) from the Department of Labor's enforcement data catalog, the severe-injury report file and the Injury Tracking Application 300A file from osha.gov, finds the establishment's inspections by name and state, attaches its citations by joining on the activity number, matches its severe-injury reports and 300A summaries by normalized employer name, computes a DART injury rate from the 300A hours and case counts, and finally compares the rate before and after a chosen inspection year. No API key is required; all four are public, key-free Department of Labor downloads. The employer-name normalization is deliberately simple—a placeholder for the real entity resolution the matching problem demands—and the exact download URLs change with each refresh, so confirm them against the current enforcedata.dol.gov catalog and the OSHA injury-data pages before running.
import requests, io, zipfile
import pandas as pd
# Assemble an establishment’s full federal safety record from four OSHA sources.
# All are public and key-free through the Department of Labor:
# 1. Inspections -- enforcedata.dol.gov bulk file (one row per inspection)
# 2. Violations -- enforcedata.dol.gov bulk file (one row per citation item)
# 3. Severe injuries (SIR) -- osha.gov download (one row per reported event)
# 4. Form 300A summaries -- osha.gov ITA download (one row per establishment-year)
# Join keys: violations link to inspections by activity_nr (the OSHA inspection
# number); inspections, SIR, and 300A identify the employer by name + address.
DOL = "https://enforcedata.dol.gov/data_catalog" # OSHA enforcement bulk files
SIR_URL = "https://www.osha.gov/sites/default/files/January2015toOctober2025.zip"
ITA_300A = "https://www.osha.gov/sites/default/files/ITA_300A_Summary_Data_2023_through_12-31-2024.zip"
def _read_zip_csv(url):
r = requests.get(url, timeout=600)
r.raise_for_status()
raw = r.content
if raw[:2] == b"PK": # PK = zip magic bytes; unwrap the first CSV inside
zf = zipfile.ZipFile(io.BytesIO(raw))
raw = zf.read(next(n for n in zf.namelist() if n.lower().endswith(".csv")))
df = pd.read_csv(io.BytesIO(raw), dtype=str, low_memory=False)
df.columns = df.columns.str.strip().str.lower().str.replace(" ", "_")
return df
def normalize_name(s):
# The hard part: employer identifiers are not standardized across the four
# forms. Cheap, deterministic normalization beats nothing -- production work
# wants real entity resolution on name + street + ZIP.
return (s.fillna("").str.upper()
.str.replace(r"[^A-Z0-9 ]", "", regex=True)
.str.replace(r"\b(INC|LLC|CO|CORP|LTD|THE)\b", "", regex=True)
.str.strip())
def establishment_record(insp, viol, sir, ita, name_substr, state):
# 1. Inspections for this employer (by fuzzy name + state)
insp = insp[insp["estab_name"].fillna("").str.contains(name_substr, case=False)]
insp = insp[insp["site_state"] == state].copy()
acts = set(insp["activity_nr"].dropna())
print(f"Inspections: {len(insp):,} (activity numbers: {len(acts):,})")
# 2. Citations -- joined to their parent inspection by activity_nr
cites = viol[viol["activity_nr"].isin(acts)].copy()
cites["penalty"] = pd.to_numeric(cites["current_penalty"], errors="coerce")
print(f"Citations: {len(cites):,} proposed penalty total: "
f"${cites['penalty'].sum():,.0f}")
# 3. Severe-injury reports for the same employer (matched by name + state)
sir_key = normalize_name(sir["employer"])
target = normalize_name(pd.Series([name_substr]))[0]
inj = sir[sir_key.str.contains(target, na=False) & (sir["state"] == state)]
print(f"Severe-injury reports (amputations / hospitalizations): {len(inj):,}")
# 4. Form 300A establishment summaries -> the injury *rate* (DART/TRIR proxy)
ita_key = normalize_name(ita["establishment_name"])
estab = ita[ita_key.str.contains(target, na=False) & (ita["state"] == state)].copy()
for c in ["total_dafw_cases", "total_djtr_cases", "annual_average_employees",
"total_hours_worked"]:
estab[c] = pd.to_numeric(estab.get(c), errors="coerce")
# DART rate = (days-away + restricted/transfer cases) * 200,000 / hours worked
estab["dart"] = ((estab["total_dafw_cases"] + estab["total_djtr_cases"])
* 200000 / estab["total_hours_worked"])
print(estab[["year_filing_for", "dart"]].dropna().to_string(index=False))
return insp, cites, inj, estab
def before_after(estab, inspection_year):
# Does the establishment’s injury rate fall after an inspection? Compare the
# mean 300A DART rate in the years before vs. after the inspection year.
estab = estab.dropna(subset=["dart"]).copy()
estab["yr"] = pd.to_numeric(estab["year_filing_for"], errors="coerce")
before = estab[estab["yr"] < inspection_year]["dart"]
after = estab[estab["yr"] >= inspection_year]["dart"]
if before.empty or after.empty:
print("Not enough 300A years on both sides of the inspection.")
return
print(f"DART before: {before.mean():.2f} after: {after.mean():.2f} "
f"change: {(after.mean() - before.mean()):+.2f}")
# insp = _read_zip_csv(".../osha_inspection.csv.zip")
# viol = _read_zip_csv(".../osha_violation.csv.zip")
# sir = _read_zip_csv(SIR_URL)
# ita = _read_zip_csv(ITA_300A)
# rec = establishment_record(insp, viol, sir, ita, "ACME MEAT", "NE")
# before_after(rec[3], inspection_year=2022)
Two refinements separate this illustration from a defensible analysis. First, the before-and-after comparison as written measures only the raw change in one establishment's DART rate around its inspection—which conflates the effect of the inspection with the secular decline in injury rates, with regression to the mean (establishments are often inspected precisely after a bad year, so the rate would have fallen anyway), and with any change in the establishment's size or product mix. A credible deterrence estimate needs a matched comparison group of similar establishments that were not inspected, and the 300A panel supplies the multi-year rates to build it. Second, the entity-resolution step is doing far more work than its three lines of normalization suggest. Production matching should score name and full-address similarity, block on ZIP to keep the comparison feasible, and carry a match-confidence flag through to the analysis so that low-confidence joins can be quarantined—because every wrong match either fabricates an injury history for an establishment that never had one or erases the history of one that did.
Limitations and analytical caveats
The assembled worker-safety record is the most complete public picture of an establishment's federal safety history available, but it inherits the structural limitations of all four of its sources, and one new limitation created by the act of joining them.
Inspection coverage is thin and non-random. OSHA reaches only a small fraction of workplaces, and the ones it reaches are selected—by complaint, by emphasis-program targeting, by data-driven rate screening, or by a fatality or catastrophe. The inspection record is therefore not a sample of American workplaces but a sample of the workplaces OSHA had reason to examine. Any statement about “cited establishments” is a statement about a heavily selected population, and comparisons that ignore why an inspection opened—treating a programmed inspection and a fatality inspection as the same observation—will draw conclusions the selection process, not the workplaces, produced.
The injury data is self-reported, and underreporting is documented. Both injury datasets depend on employers reporting honestly. Severe-injury reports require the employer to phone in an amputation or hospitalization, and studies of the 2015 rule found substantial non-reporting; the 300A summary is the employer's own tally of recordable cases, and there is a long, well-documented history of establishments under-recording injuries on their logs—sometimes precisely to keep their reported rate below the threshold that would trigger data-driven inspection targeting. A workplace that suppresses its 300A rate looks safer in the data and is less likely to be inspected, a feedback loop that makes the absence of recorded injuries a weak guarantee of actual safety.
The State Plan gap fragments coverage. Roughly half the states run their own OSHA programs under Section 18 of the OSH Act, and the federal enforcement bulk files do not uniformly contain the inspections and citations those State Plan programs conduct—a workplace in California or Washington may have an extensive state-level enforcement history that is invisible in the federal inspection and violation tables even as its severe-injury reports and 300A summaries appear in the national files. An establishment with no federal inspections is not necessarily an establishment that has never been inspected; it may simply sit in a State Plan jurisdiction. Cross-state comparisons of enforcement built on the federal files alone will systematically understate activity in State Plan states.
The join itself introduces error. The clean activity-number join between citations and inspections is reliable, but the establishment-level matching across inspections, injuries, and 300A summaries is fuzzy by necessity, and every match decision is a possible error. False matches fuse the histories of distinct workplaces—attributing one plant's amputations to another—while missed matches split one establishment's record into fragments and silently break the before-and-after linkage the analysis depends on. The match rate and the false-match rate are not nuisances to be hidden; they are first-class results that belong in any honest report, because the substantive findings are only as trustworthy as the entity resolution beneath them.
Held with these caveats in mind, the joined record across osha_inspections, osha_violations, osha_severe_injuries, and osha_300a is a uniquely valuable instrument: it reconstructs, for a single workplace, the full arc of federal safety oversight—the inspection that opened, the standards it cited, the severe injuries the employer reported, and the injury rates the establishment posted before and after—and so lets an analyst test, rather than assume, whether the country's thin and selective enforcement actually leaves workers safer than it found them.
Related writing
OSHA Violations Database: The Federal Record of 200,000 Annual Workplace Safety Citations — The citations half of the join: every citation issued after an inspection, keyed to its parent inspection by the activity number, with the standard violated, the violation type, and the proposed and final penalties that the assembled record weighs against the harm done.
OSHA Severe Injury Reports: The Federal Record of Amputations and Hospitalizations Since 2015 — The event stream this guide attaches to each establishment, the near-real-time record of amputations and hospitalizations that employers must report within a day and that frequently becomes the reason an inspection opens.
OSHA 300A Injury and Illness Data: The Federal Database Behind Establishment-Level Workplace Injury Rates — The denominator that turns counts into rates, the annual establishment summaries whose hours-worked and case counts make the before-and-after injury-rate comparison at the heart of the deterrence question possible.