Technical writing

FDA FAERS: The Adverse Drug Event Database Behind Post-Market Drug Safety

· 14 min read· AI Analytics
Federal DataFDADrug SafetyPharmacovigilance

The FDA Adverse Event Reporting System—FAERS—is the primary federal mechanism for tracking drug and biologic safety after a product reaches the market. When a patient dies while taking a medication, when a physician observes an unexpected reaction, or when a manufacturer learns of a serious adverse outcome, those reports flow into FAERS. The database now holds more than 30 million reports going back to 2004, with quarterly bulk files available as far back as 2012. It is the foundation of post-market pharmacovigilance in the United States, and it is fully public.

This article covers what FAERS contains and how its seven relational files link together, the MedDRA medical terminology hierarchy used to code adverse reactions, the reporting hierarchy and its built-in biases, the data access paths (dashboard, bulk download, OpenFDA API), the statistical methods used to detect safety signals before formal recalls, major historical signal cases including rosiglitazone, rofecoxib, and SSRIs in adolescents, the PRIMARYID versus CASEID deduplication problem that corrupts naive analyses, a Python workflow for downloading and running a proportional reporting ratio calculation, and the fundamental limitations of voluntary reporting that every analyst must understand.

What FAERS collects and why

When the FDA approves a drug, clinical trials have typically enrolled thousands of patients over years. That sample is large enough to detect common adverse effects but far too small to detect reactions that occur in one in ten thousand patients, reactions that only emerge after five years of chronic use, or interactions with other drugs that were excluded from the trial population. Post-market surveillance exists to fill that gap. FAERS is the main instrument.

Reports enter the system through three channels. Drug manufacturers are legally required under 21 CFR Part 314 to submit reports of serious adverse events within 15 calendar days of becoming aware of them—the “15-day alert report” requirement. Non-serious events and periodic safety updates follow on quarterly or annual schedules. Healthcare providers—physicians, nurses, pharmacists—submit voluntary reports through MedWatch, the FDA's safety reporting program. Consumers and patients can also submit MedWatch reports directly. The practical result is that manufacturer reports dominate FAERS numerically, accounting for roughly 90 percent of submissions, while voluntary healthcare provider and consumer reports are sparser but often more clinically detailed.

The seven-file relational schema

FAERS bulk data is distributed as quarterly ZIP files, each containing seven pipe-delimited ASCII text files. Every table links back to a central case identifier. Understanding the schema is mandatory before doing any analysis—naive joins produce silently wrong results.

DEMO is the demographics table and the spine of the schema. One row per report. It holds the patient's age, sex, weight, country of origin, report date, event date, and the key identifiers: primaryid and caseid. The primaryid is a globally unique report identifier. The caseid is the underlying case. When a case is followed up—for example, when a manufacturer files an updated report after a patient dies who was initially reported as hospitalized—a new primaryid is assigned to the same caseid. All seven tables use primaryid as their foreign key.

DRUG contains one row per drug per report. A single report frequently lists multiple drugs. Each row includes the drug name as reported (often brand name, often inconsistently formatted), the drug's role in the case (primary suspect, secondary suspect, concomitant, or interacting), dose, route of administration, and dates of therapy start and end. Drug name normalization is one of the hardest problems in FAERS analysis—the same molecule may appear as a brand name, a generic name, an abbreviation, or a misspelling across thousands of reports.

REAC lists the adverse reactions coded to MedDRA Preferred Terms, one row per reaction per report. A report with three reactions produces three REAC rows. This is where the medical signal lives.

OUTC records the outcome of the adverse event: death, life-threatening event, hospitalization, disability, congenital anomaly, required intervention to prevent permanent impairment, or other. A single case can have multiple outcomes. Death cases are a particular focus of regulatory attention and are separately flagged in the DEMO table via the death_dt field.

RPSR identifies the report source: healthcare professional, consumer, literature, foreign regulatory authority, or company study. This field is critical for analysis because the reporting population differs substantially across source types. Literature reports, for instance, describe events that were published in medical journals and then submitted to FAERS as secondary reports—they may involve older events and systematically different populations.

THER contains therapy date information: start date, end date, and duration of drug exposure. Because many reports are missing therapy dates or have implausible values, time-to-onset analyses built on THER data require extensive validation. The field is sparsely populated for consumer-submitted reports.

INDI records the indication for which the drug was used, again coded to MedDRA Preferred Terms. This file is often overlooked but is analytically important: a drug prescribed for indication A in population X may have a very different adverse event profile than the same drug prescribed for indication B in population Y, and INDI is the only field that surfaces this distinction.

MedDRA: the medical terminology hierarchy

All adverse reactions in FAERS are coded using the Medical Dictionary for Regulatory Activities, universally abbreviated as MedDRA. MedDRA is an international standard maintained by the International Council for Harmonisation of Technical Requirements for Pharmaceuticals for Human Use. It organizes medical concepts into a five-level hierarchy.

At the most granular level are Lowest Level Terms(LLTs)—the specific phrases used by reporters, which map upward to a Preferred Term (PT). The PT is the primary unit of analysis in FAERS. Each PT maps to a High Level Term (HLT), which groups into a High Level Group Term (HLGT), which rolls up to a System Organ Class (SOC). SOCs represent broad physiological categories: cardiac disorders, hepatobiliary disorders, nervous system disorders, and so on—27 in total.

MedDRA is proprietary. Analysts must subscribe to access the full hierarchy mapping files, though academic and government researchers can obtain licenses at reduced cost. For FAERS analysis, the critical mapping is from PT codes (five-digit integers) to PT names, HLT names, and SOC names. The REAC table stores PT names as text strings, so basic reaction counting is possible without a MedDRA license, but hierarchical rollup to SOC level requires the mapping file.

One MedDRA concept that frequently causes confusion is the Standard MedDRA Query (SMQ). SMQs are curated lists of PTs that together represent a clinically meaningful syndrome—for example, the “Hepatotoxicity” SMQ bundles dozens of individual liver injury PTs that a naive search for the single term “liver failure” would miss. Signal detection on SMQs rather than individual PTs substantially increases sensitivity for complex adverse event syndromes.

The reporting hierarchy and its built-in distortions

The mandatory 15-day reporting requirement for serious events means that manufacturer reports are systematically biased toward serious outcomes in newly approved drugs. When a drug is first approved, the manufacturer's pharmacovigilance team actively monitors all incoming adverse event reports. As the drug ages and post-market attention normalizes, the volume of manufacturer-submitted spontaneous reports may decline even if the true adverse event rate holds steady.

The FDA's own estimates suggest that fewer than one percent of serious adverse events experienced in clinical practice are ever reported to FAERS. The actual underreporting fraction varies enormously by event type: a patient death is far more likely to be reported than a mild adverse reaction, and a novel or unexpected reaction is more likely to be reported than a reaction that is already listed in the drug's label. This is known as the Weber effect—reporting volume for a new drug peaks in the first two years after approval and then declines, independent of the actual adverse event rate. Any analysis that treats FAERS counts as absolute incidence numbers is methodologically invalid.

There is also no denominator. FAERS reports adverse events but contains no data on total prescriptions dispensed, patient exposure time, or the size of the treated population. Without a denominator, an absolute count of 500 reports of drug-induced liver injury is uninterpretable—it could reflect a genuinely dangerous drug prescribed to ten million patients, or a very safe drug that generated outsized media attention and consumer reporting. Denominator data exists elsewhere (IMS/IQVIA prescription counts, CMS Part D claims, state prescription drug monitoring programs) but must be linked externally.

Data access: dashboard, bulk files, and OpenFDA

The FDA provides three distinct access paths to FAERS data, each suited to different analytical needs.

The FAERS Public Dashboard at fda.gov/drugs/questions-and-answers-fdas-adverse-event-reporting-system-faers is a browser-based query interface. Analysts can search by drug name, reaction term, or report source and retrieve counts and summary tables. The dashboard is useful for quick lookups and for communicating findings to non-technical audiences, but it has no export capability for raw-level data and applies its own internal deduplication logic that is not fully documented.

The FAERS quarterly bulk files are ZIP archives available for download from the FDA's FIS export server. Files cover every quarter from 2012Q1 through the most recent completed quarter. Pre-2012 data (from the legacy AERS system) is available in a separate archive. Each quarterly ZIP contains the seven ASCII tables described above. The files are large: a full 2023 quarterly ZIP runs roughly 400–600 MB compressed, with the DRUG table being the largest. Analysts working with multiple years of data should expect to process tens of gigabytes.

The OpenFDA API at api.fda.gov/drug/event.json provides programmatic access to a subset of FAERS data through a RESTful interface. The API supports field-level filtering, full-text search, and count queries against a pre-processed version of the FAERS dataset. The count endpoint is particularly useful for rapid frequency analysis: api.fda.gov/drug/event.json?search=patient.drug.medicinalproduct:WARFARIN&count=patient.reaction.reactionmeddrapt.exactreturns a ranked list of all reaction terms reported for warfarin. The API applies its own deduplication and has an undocumented data lag relative to the bulk files, typically one to two quarters behind. Rate limits apply to unauthenticated requests (240 per minute); API keys are available for free.

Disproportionality analysis: how safety signals are detected

Because FAERS cannot support absolute incidence estimates, drug safety analysts use a class of methods called disproportionality analysis to ask a different question: is the proportion of reports involving drug X and reaction Y higher than would be expected given the overall reporting patterns in the database? The two most widely used measures are the Proportional Reporting Ratio (PRR) and the Reporting Odds Ratio (ROR).

The Proportional Reporting Ratio compares the proportion of reports for drug X that mention reaction Y against the proportion of reports for all other drugs that mention reaction Y. In a standard two-by-two contingency table, cell a is reports of drug X with reaction Y, cell b is reports of drug X without reaction Y, cell c is reports of all other drugs with reaction Y, and cell d is reports of all other drugs without reaction Y. The PRR equals (a / (a+b)) divided by (c / (c+d)). A PRR of 2.0 means the reaction is twice as proportionally common in reports for drug X as in reports for everything else in the database. The European Medicines Agency threshold for a potential signal is PRR ≥ 2 with a chi-squared statistic ≥ 4 and at least 3 cases. This is a screening criterion, not a causal finding.

The Reporting Odds Ratio is calculated as (a × d) / (b × c)—the odds of reaction Y in drug X reports relative to the odds of reaction Y in all other reports. The ROR is preferred in settings with sparse data because it behaves better statistically at low cell counts. The FDA's own FAERS signal detection system, the Empirical Bayes Geometric Mean (EBGM), uses a Bayesian shrinkage approach that borrows strength across similar drug-reaction pairs to stabilize estimates for sparse cells. The EBGM score is published by the FDA in its FAERS quarterly signal detection reports but the underlying algorithm weights are not fully disclosed.

Disproportionality signals are hypotheses, not conclusions. A PRR above threshold for a drug-reaction pair triggers a formal pharmacovigilance review, which may involve requesting additional data from the manufacturer, querying medical literature, reviewing clinical trial data, and in some cases commissioning an epidemiological study. The signal-to-action timeline can be years.

Major signal cases

Three historical cases illustrate how FAERS-based disproportionality analysis intersects with regulatory action—and how long that intersection can take.

Rosiglitazone (Avandia) and cardiovascular events. Rosiglitazone was approved in 1999 for type 2 diabetes. By 2007, FAERS contained thousands of reports of myocardial infarction and heart failure in rosiglitazone users. A meta-analysis published in the New England Journal of Medicine in May 2007 found a 43 percent increase in myocardial infarction risk. Retrospective analysis of FAERS data showed that the disproportionality signal for cardiac failure had been statistically detectable years earlier. The FDA added a black box warning in 2007 and in 2010 imposed severe prescribing restrictions under a Risk Evaluation and Mitigation Strategy (REMS). The case became a benchmark for how long a population-level safety signal can persist undetected in spontaneous reporting data.

Rofecoxib (Vioxx) and cardiovascular risk. Rofecoxib was voluntarily withdrawn by Merck in September 2004 after the APPROVe clinical trial showed a doubling of cardiovascular event risk with long-term use. Retrospective analysis of FAERS and insurance claims data showed that a statistically meaningful disproportionality signal for myocardial infarction had been present in the spontaneous reporting database by 2001, three years before withdrawal. The FDA's own post-hoc review acknowledged that the signal detection process had not been applied systematically. Vioxx accelerated the FDA's development of its formal Sentinel System and reshaped the pharmacovigilance mandate in the 2007 FDA Amendments Act.

SSRIs and suicidality in adolescents. Beginning in 2003, FAERS data showed an elevated rate of suicidal ideation and self-harm reports in pediatric patients prescribed selective serotonin reuptake inhibitors (SSRIs). The signal was complicated by the fact that depression itself elevates suicide risk, making disproportionality analysis difficult to interpret without indication data. After reviewing clinical trial data submitted by manufacturers, the FDA issued a black box warning in 2004 for pediatric patients and in 2006 extended it to young adults aged 18 to 24. The case is a canonical example of how FAERS signals must be interpreted alongside clinical trial data and indication information from the INDI table—the reporting population for a psychiatric drug differs fundamentally from the reporting population for a diabetes drug.

The PRIMARYID versus CASEID deduplication problem

The most common analytical error in FAERS bulk data analysis is failing to deduplicate on caseid before counting cases. The issue arises from the follow-up report mechanism. When a case is initially submitted and then updated (because the patient's outcome changed, because the manufacturer received additional information, or because a correction was required), a new primaryid is generated. The original and all follow-up reports share the same caseid. Without deduplication, every follow-up appears as a separate case, inflating case counts and distorting disproportionality measures.

The FDA's recommended deduplication approach is to retain only the most recent (highest) primaryid per caseid. This requires loading the DEMO table across all quarters, sorting descending by primaryid within each caseid group, and keeping the first row. The deduplicated set of primaryid values is then used to filter all other tables before any join. In practice, deduplication across a multi-year FAERS corpus can reduce raw case counts by 15 to 25 percent, with the effect being largest in drugs that have been on the market longest and that have had safety communications prompting re-examination of historical reports.

A subtler issue is inter-quarter duplication. The same case may appear in multiple quarterly files if a follow-up report was submitted after the initial reporting quarter. The safest practice is to concatenate all DEMO tables across all quarters before deduplicating, rather than deduplicating within each quarter independently and then concatenating.

Python: downloading, deduplicating, and computing a PRR

The following script downloads quarterly FAERS bulk files, loads the DEMO, DRUG, and REAC tables, applies PRIMARYID-based deduplication, and computes a Proportional Reporting Ratio for a target drug-reaction pair. The example reproduces the rosiglitazone / myocardial infarction signal.

import requests
import zipfile
import io
import os
import pandas as pd
from itertools import product

# ---------------------------------------------------------------
# 1. Download quarterly FAERS bulk files from FDA
#    Files are named FAERS_ASCII_<YYYYQ>.zip, e.g. FAERS_ASCII_2023Q4.zip
# ---------------------------------------------------------------

BASE_URL = "https://fis.fda.gov/content/Exports/"

def faers_zip_name(year, quarter):
    # FDA switched filename convention at 2012Q3; pre-2012 uses different pattern
    return "FAERS_ASCII_" + str(year) + "Q" + str(quarter) + ".zip"

def download_quarter(year, quarter, dest_dir="faers_raw"):
    os.makedirs(dest_dir, exist_ok=True)
    fname = faers_zip_name(year, quarter)
    url = BASE_URL + fname
    out_path = os.path.join(dest_dir, fname)
    if os.path.exists(out_path):
        print("Already have " + fname)
        return out_path
    print("Downloading " + url)
    r = requests.get(url, timeout=120, stream=True)
    r.raise_for_status()
    with open(out_path, "wb") as f:
        for chunk in r.iter_content(chunk_size=65536):
            f.write(chunk)
    print("Saved " + fname + " (" + str(os.path.getsize(out_path) // 1024 // 1024) + " MB)")
    return out_path

# Download 2022 and 2023 (8 quarters)
for year, q in product([2022, 2023], [1, 2, 3, 4]):
    try:
        download_quarter(year, q)
    except requests.HTTPError as e:
        print("Skipping " + str(year) + "Q" + str(q) + ": " + str(e))

# ---------------------------------------------------------------
# 2. Extract and load the DEMO and REAC files
#    Each ZIP contains ASCII pipe-delimited .txt files, one per table.
#    The seven tables: DEMO, DRUG, REAC, OUTC, RPSR, THER, INDI
# ---------------------------------------------------------------

def load_table_from_zip(zip_path, table_name):
    """Extract a specific table file from a quarterly ZIP and return a DataFrame."""
    with zipfile.ZipFile(zip_path, "r") as zf:
        # The inner files use a pattern like ASCII/DEMO22Q4.txt
        matches = [n for n in zf.namelist()
                   if table_name.upper() in n.upper() and n.endswith(".txt")]
        if not matches:
            raise FileNotFoundError("No " + table_name + " file in " + zip_path)
        fname = matches[0]
        with zf.open(fname) as f:
            df = pd.read_csv(f, sep="$", encoding="latin-1", low_memory=False,
                             dtype=str)
    return df

raw_dir = "faers_raw"
demo_frames = []
reac_frames = []

for zip_file in sorted(os.listdir(raw_dir)):
    if not zip_file.endswith(".zip"):
        continue
    zp = os.path.join(raw_dir, zip_file)
    try:
        demo_frames.append(load_table_from_zip(zp, "DEMO"))
        reac_frames.append(load_table_from_zip(zp, "REAC"))
    except Exception as e:
        print("Error in " + zip_file + ": " + str(e))

demo = pd.concat(demo_frames, ignore_index=True)
reac = pd.concat(reac_frames, ignore_index=True)
print("DEMO rows:", len(demo), "  REAC rows:", len(reac))

# ---------------------------------------------------------------
# 3. Deduplicate using PRIMARYID
#    Each case can be followed-up, producing multiple reports with
#    the same caseid but different primaryids. The highest (most recent)
#    primaryid per caseid is the canonical version.
# ---------------------------------------------------------------

demo["primaryid"] = pd.to_numeric(demo["primaryid"], errors="coerce")
demo["caseid"]    = pd.to_numeric(demo["caseid"],    errors="coerce")

# Keep only the most recent follow-up per case
demo_dedup = (
    demo
    .sort_values("primaryid", ascending=False)
    .drop_duplicates(subset=["caseid"], keep="first")
    .copy()
)
print("After dedup:", len(demo_dedup), "unique cases")

# Keep only reactions linked to canonical primaryids
canonical_ids = set(demo_dedup["primaryid"].dropna().astype(int))
reac_dedup = reac[reac["primaryid"].astype(float).astype("Int64")
                  .isin(canonical_ids)].copy()

# ---------------------------------------------------------------
# 4. Proportional Reporting Ratio (PRR) for a target drug
#    PRR = (a / (a+b)) / (c / (c+d))
#    where a = reports of drug X with reaction Y
#          b = reports of drug X without reaction Y
#          c = reports of other drugs with reaction Y
#          d = reports of other drugs without reaction Y
#    PRR >= 2 with N >= 3 is a common regulatory threshold.
# ---------------------------------------------------------------

# Load DRUG table (already deduped by primaryid cascade)
drug_frames = []
for zip_file in sorted(os.listdir(raw_dir)):
    if not zip_file.endswith(".zip"):
        continue
    zp = os.path.join(raw_dir, zip_file)
    try:
        drug_frames.append(load_table_from_zip(zp, "DRUG"))
    except Exception as e:
        print("DRUG error in " + zip_file + ": " + str(e))

drug = pd.concat(drug_frames, ignore_index=True)
drug["primaryid"] = pd.to_numeric(drug["primaryid"], errors="coerce")
drug_dedup = drug[drug["primaryid"].isin(canonical_ids)].copy()

# Normalise drug names to upper-case generic
drug_dedup["drugname_clean"] = (
    drug_dedup["drugname"]
    .str.upper()
    .str.strip()
    .str.replace(r"\s+", " ", regex=True)
)

TARGET_DRUG    = "ROSIGLITAZONE"
TARGET_REACTION = "MYOCARDIAL INFARCTION"  # MedDRA Preferred Term, upper-case

# Cases reporting target drug
target_cases = set(
    drug_dedup[drug_dedup["drugname_clean"] == TARGET_DRUG]["primaryid"]
)

reac_dedup["pt_upper"] = reac_dedup["pt"].str.upper().str.strip()
reac_dedup["primaryid_int"] = pd.to_numeric(reac_dedup["primaryid"], errors="coerce")

a = len(
    reac_dedup[
        reac_dedup["primaryid_int"].isin(target_cases) &
        (reac_dedup["pt_upper"] == TARGET_REACTION)
    ]
)
b = len(
    reac_dedup[
        reac_dedup["primaryid_int"].isin(target_cases) &
        (reac_dedup["pt_upper"] != TARGET_REACTION)
    ]
)
c = len(
    reac_dedup[
        ~reac_dedup["primaryid_int"].isin(target_cases) &
        (reac_dedup["pt_upper"] == TARGET_REACTION)
    ]
)
d = len(
    reac_dedup[
        ~reac_dedup["primaryid_int"].isin(target_cases) &
        (reac_dedup["pt_upper"] != TARGET_REACTION)
    ]
)

if (a + b) > 0 and (c + d) > 0 and c > 0:
    prr = (a / (a + b)) / (c / (c + d))
    print("a=" + str(a) + " b=" + str(b) + " c=" + str(c) + " d=" + str(d))
    print("PRR for " + TARGET_DRUG + " / " + TARGET_REACTION + ": " + str(round(prr, 2)))
    if prr >= 2 and a >= 3:
        print("SIGNAL DETECTED (PRR >= 2, N >= 3)")
    else:
        print("No signal at standard threshold")
else:
    print("Insufficient data for PRR calculation")

Accessing FAERS for specific research use cases

For signal surveillance across many drug-reaction pairs simultaneously, the OpenFDA count endpoint is faster than bulk processing. For case-level analysis requiring patient demographics, indication data, or time-to-onset calculations, the quarterly bulk files are necessary. For longitudinal analysis spanning more than a decade, analysts should obtain the legacy AERS data (2004 through 2012Q1) and align it with the FAERS schema, noting that the AERS file format differs slightly.

The FDA does not expose FAERS through a standard SPARQL or SQL interface. Analysts working at scale typically ingest the quarterly ZIPs into a relational database (PostgreSQL or DuckDB work well given the file sizes) and run disproportionality analyses as SQL queries rather than in pandas. DuckDB in particular handles pipe-delimited files natively and can execute a multi-year PRR calculation across hundreds of drug-reaction pairs in under a minute on a modern laptop.

The FDA's FAERS data use agreement requires that researchers not attempt to re-identify patients from the published data. The demographic fields are sparse enough that re-identification from FAERS alone is generally infeasible, but combining FAERS with external databases containing patient-level prescription records requires careful privacy analysis under HIPAA and relevant IRB protocols.

Limitations every analyst must know

FAERS is a signal generation system, not an epidemiological database. The core limitations are structural and cannot be overcome by better analysis of the existing data.

Massive underreporting. The FDA estimates that fewer than one percent of adverse events in clinical practice are reported. High-profile safety communications, media coverage, and litigation dramatically increase reporting for the affected drug, creating the illusion of a safety signal emerging precisely when public attention peaks. This “notoriety bias” can overwhelm genuine pharmacovigilance signals in the PRR calculation.

Manufacturer bias in content. Approximately 90 percent of FAERS submissions are from drug manufacturers, whose reports are shaped by legal and regulatory considerations. Manufacturer reports tend to be more complete in some fields (drug lot number, route of administration) and less complete in others (indication, concomitant medications that might explain the event). The manufacturer has an incentive to submit reports that fulfil the 15-day requirement while framing the event in ways that minimize attributability to the drug.

No denominator for prescription volume. Without knowing how many patients took a drug, a report count is not an incidence estimate. A drug prescribed to ten million patients with 500 serious adverse event reports may be far safer than a drug prescribed to fifty thousand patients with 30 reports. Any analysis that compares absolute FAERS counts across drugs with different market sizes is methodologically flawed. Denominator linkage to prescription claims data is the correct approach for comparative safety research.

Confounding by indication. Patients prescribed drug X are not a random sample of the population. They have the disease or condition for which drug X is indicated, and that condition may itself be associated with the adverse outcome under investigation. The INDI table partially addresses this by recording the reported indication, but the field is sparsely populated and self-reported.

Despite these limitations, FAERS remains the most comprehensive publicly available source for post-market drug safety intelligence in the United States. For a drug that has been on the market for five or more years, FAERS often contains thousands of detailed case reports that no clinical trial ever captured. Used carefully— with explicit attention to reporting biases, systematic deduplication, and disproportionality rather than absolute counting—it is a powerful tool for drug safety research, regulatory journalism, and comparative effectiveness work.


Related writing

FDA Warning Letters: The Public Enforcement Record for 100,000+ Regulatory Actions — The FDA publishes every warning letter it sends on fda.gov — pharmaceutical cGMP violations, food HACCP failures, device 510(k) deficiencies, dietary supplement claims, and clinical investigator fraud. Here is the data structure, bulk access methods, and how to analyze 100,000+ enforcement actions by category and year.

The Wall of Shame: what the HHS-OCR HIPAA breach database reveals about healthcare data security — HHS-OCR publishes every reported healthcare data breach affecting 500 or more patients. Over 5,000 entries covering ransomware attacks, stolen laptops, unauthorized employee access, and business associate failures.

380 million transactions: indexing the DEA's ARCOS opioid distribution data — How we indexed 380 million DEA ARCOS controlled-substance transaction records from the opioid MDL discovery release, what the data reveals about pill distribution, and how to cross-reference it against DEA enforcement actions and CDC overdose mortality.