Technical writing

The Drug Lifecycle in Federal Data: From Approval to the Pharmacy Counter

· 9 min read· AI Analytics
FDACMSHealthcareOpen DataData Engineering

A prescription drug leaves a paper trail across the federal government that almost nobody reads end to end. The FDA approves it. A labeler registers every package of it. A manufacturer pays the doctors who will prescribe it. Medicare reports who prescribed it and what it cost. And the CDC counts the deaths when it goes wrong. Each of these is a separate public dataset; joined, they are the life story of a molecule. This is how to assemble that pipeline, and where the joins break.

The six datasets

The join spine: NDC, ingredient, manufacturer

There is no single key that runs end to end — the pipeline is held together by three different joins. The NDC code links the FDA NDC directory to Medicare Part D spending (both are NDC-keyed). The active ingredient links FDA approvals, prescribing, and overdose data, which speak in drug names rather than package codes. And the manufacturer / labeler name links Open Payments (which records the paying company) to the products that company makes. Get those three right and the whole life of a drug resolves; get them wrong and you double-count or lose the thread.

# The join spine: the National Drug Code (NDC) plus the active ingredient.
# FDA NDC directory -> Part D spending share an 11-digit NDC; Open Payments and
# overdose data join on the ingredient / drug name, not the NDC.

import requests

# 1. FDA: when was the drug approved, and by whom? (Drugs@FDA / openFDA)
appr = requests.get(
    "https://api.fda.gov/drug/drugsfda.json",
    params={"search": 'openfda.generic_name:"oxycodone"', "limit": 5}, timeout=30,
).json()

# 2. NDC directory: every marketed product + labeler for that ingredient.
ndc = requests.get(
    "https://api.fda.gov/drug/ndc.json",
    params={"search": 'active_ingredients.name:"oxycodone"', "limit": 5}, timeout=30,
).json()

# 3-5. Open Payments (manufacturer -> prescriber $), Part D (prescriber + spend),
#      CDC overdose mortality (downstream harm) join on manufacturer + ingredient.
# Normalize NDC to 11 digits (5-4-2) before joining FDA NDC to Part D.
def ndc11(code):  # CMS uses 11-digit; FDA prints 10-digit in 4-4-2/5-3-2/5-4-1
    a, b, c = code.split("-")
    return f"{a:0>5}{b:0>4}{c:0>2}"

The gotchas that break the join

  • NDC format drift. The FDA prints 10-digit NDCs in three segment layouts (4-4-2, 5-3-2, 5-4-1); CMS uses a zero-padded 11-digit form. Join without normalizing and most rows miss.
  • Brand vs generic. One ingredient maps to many brands and many labelers; spending and harm aggregate at the ingredient level while payments and approvals live at the product/sponsor level.
  • Manufacturer name normalization. The paying entity in Open Payments, the labeler in the NDC directory, and the sponsor in Drugs@FDA are often the same company under three spellings and two subsidiaries.
  • Different denominators. Part D is Medicare only; Open Payments covers all payers; overdose mortality is population-wide. The pipeline shows correlation across stages, not a closed accounting.

What the assembled pipeline answers

Joined, the six datasets answer questions no single one can: which newly approved drugs saw the largest manufacturer-payment campaigns to prescribers; whether payments to a physician track that physician's prescribing and the program's spending; and, for controlled substances, how the approval-to-marketing-to-prescribing chain sits against the overdose curve. None of it requires a special key or a paid feed — every stage is a public federal dataset, joinable by anyone willing to normalize an NDC.


Related writing: The Opioid Epidemic in Federal Data — the same pipeline focused on one drug class, joining DEA ARCOS distribution, prescribing, treatment, and mortality.

See also: CMS Open Payments — the Sunshine Act dataset behind the marketing-payments stage of this pipeline.