A prescription drug leaves a paper trail across the federal government that almost nobody reads end to end. The FDA approves it. A labeler registers every package of it. A manufacturer pays the doctors who will prescribe it. Medicare reports who prescribed it and what it cost. And the CDC counts the deaths when it goes wrong. Each of these is a separate public dataset; joined, they are the life story of a molecule. This is how to assemble that pipeline, and where the joins break.
The six datasets
- FDA approvals (Drugs@FDA) — the moment a drug enters the market: application number, sponsor, approval date, and product details.
- The National Drug Code directory — every marketed product and its labeler, keyed by the NDC. This is the spine the rest hang on.
- CMS Open Payments — the Sunshine Act record of payments from drug and device makers to physicians and teaching hospitals.
- Medicare Part D prescribers and Part D drug spending — who prescribes it, and what the program pays.
- CDC overdose mortality — for the relevant drug classes, the downstream harm, by substance and geography.
The join spine: NDC, ingredient, manufacturer
There is no single key that runs end to end — the pipeline is held together by three different joins. The NDC code links the FDA NDC directory to Medicare Part D spending (both are NDC-keyed). The active ingredient links FDA approvals, prescribing, and overdose data, which speak in drug names rather than package codes. And the manufacturer / labeler name links Open Payments (which records the paying company) to the products that company makes. Get those three right and the whole life of a drug resolves; get them wrong and you double-count or lose the thread.
# The join spine: the National Drug Code (NDC) plus the active ingredient.
# FDA NDC directory -> Part D spending share an 11-digit NDC; Open Payments and
# overdose data join on the ingredient / drug name, not the NDC.
import requests
# 1. FDA: when was the drug approved, and by whom? (Drugs@FDA / openFDA)
appr = requests.get(
"https://api.fda.gov/drug/drugsfda.json",
params={"search": 'openfda.generic_name:"oxycodone"', "limit": 5}, timeout=30,
).json()
# 2. NDC directory: every marketed product + labeler for that ingredient.
ndc = requests.get(
"https://api.fda.gov/drug/ndc.json",
params={"search": 'active_ingredients.name:"oxycodone"', "limit": 5}, timeout=30,
).json()
# 3-5. Open Payments (manufacturer -> prescriber $), Part D (prescriber + spend),
# CDC overdose mortality (downstream harm) join on manufacturer + ingredient.
# Normalize NDC to 11 digits (5-4-2) before joining FDA NDC to Part D.
def ndc11(code): # CMS uses 11-digit; FDA prints 10-digit in 4-4-2/5-3-2/5-4-1
a, b, c = code.split("-")
return f"{a:0>5}{b:0>4}{c:0>2}"The gotchas that break the join
- NDC format drift. The FDA prints 10-digit NDCs in three segment layouts (4-4-2, 5-3-2, 5-4-1); CMS uses a zero-padded 11-digit form. Join without normalizing and most rows miss.
- Brand vs generic. One ingredient maps to many brands and many labelers; spending and harm aggregate at the ingredient level while payments and approvals live at the product/sponsor level.
- Manufacturer name normalization. The paying entity in Open Payments, the labeler in the NDC directory, and the sponsor in Drugs@FDA are often the same company under three spellings and two subsidiaries.
- Different denominators. Part D is Medicare only; Open Payments covers all payers; overdose mortality is population-wide. The pipeline shows correlation across stages, not a closed accounting.
What the assembled pipeline answers
Joined, the six datasets answer questions no single one can: which newly approved drugs saw the largest manufacturer-payment campaigns to prescribers; whether payments to a physician track that physician's prescribing and the program's spending; and, for controlled substances, how the approval-to-marketing-to-prescribing chain sits against the overdose curve. None of it requires a special key or a paid feed — every stage is a public federal dataset, joinable by anyone willing to normalize an NDC.
Related writing: The Opioid Epidemic in Federal Data — the same pipeline focused on one drug class, joining DEA ARCOS distribution, prescribing, treatment, and mortality.
See also: CMS Open Payments — the Sunshine Act dataset behind the marketing-payments stage of this pipeline.