Technical writing

ClinicalTrials.gov: The Federal Database Behind Every US Drug and Device Trial

· 13 min read· AI Analytics
Federal DataNIHClinical TrialsDrug Development

ClinicalTrials.gov is the federally mandated public registry for clinical research conducted in the United States or funded by US agencies. Since 2007, federal law has required sponsors to register trials of drugs, biologics, and devices before the first participant is enrolled. The database now holds more than 500,000 study records from 221 countries. It is the only systematic denominator for clinical research—the one place where a trial must appear whether its results are positive, negative, or never published.

This article covers the full data schema of a ClinicalTrials.gov record, the legal framework that created mandatory registration and results reporting, the stubborn compliance failure that has left more than half of completed trials without reported results, the two primary access paths (the API v2 and the AACT PostgreSQL database from Duke University), the publication bias problem that trial registration was designed to address, the enforcement gap that has allowed non-compliance to persist for nearly two decades, a Python workflow for pulling trials for a drug class and computing registration-to-results lag, and how medical journalists and researchers use the registry to investigate sponsor behavior, trial design manipulation, and off-label drug marketing.

The legal framework: FDAAA 801 and the 2016 Final Rule

The Food and Drug Administration Amendments Act of 2007—FDAAA 801—created the mandatory registration requirement. Before 2007, ClinicalTrials.gov existed but registration was voluntary except for trials of serious or life-threatening conditions seeking FDA approval. FDAAA 801 changed this: applicable clinical trials of drugs, biologics, and devices regulated by the FDA must register at ClinicalTrials.gov no later than 21 days after enrolling the first participant. “Applicable clinical trial” covers Phase 2, 3, and 4 interventional trials of FDA-regulated products with at least one US site or conducted under an IND or IDE. Phase 1 trials are generally exempt unless they also assess efficacy.

FDAAA 801 also imposed a results reporting requirement: sponsors must submit summary results to ClinicalTrials.gov within 12 months of the primary completion date— defined as the date the last participant was examined or received an intervention for the purposes of collecting data on the primary outcome measure. The 12-month clock runs from primary completion date, not from publication, not from FDA approval. Congress gave HHS authority to levy civil monetary penalties of up to $10,000 per day for failure to register or to report results on time.

The 2016 Final Rule issued by HHS and NIH expanded results reporting to include trials of unapproved products—closing a major loophole through which sponsors had claimed exemption for trials not yet seeking FDA approval—and added requirements for adverse event reporting in the results section. The rule also created a structured format for primary and secondary outcome measure results. Trials initiated after January 18, 2017, are subject to the full 2016 rule. Pre-2017 trials completed after the effective date are also covered.

What a ClinicalTrials.gov record contains

Every registered study receives a unique NCT number—an identifier of the form NCT followed by eight digits (e.g., NCT04280705). The NCT number is the canonical citation key for a trial in regulatory submissions, journal articles, FDA label cross-references, and news coverage. It is stable across the life of the study.

The registration record contains several distinct modules. The identification module holds the NCT number, brief and official title, acronym if any, the organization that registered the study, and the secondary identifiers used by the sponsor or funding agency (IND numbers, grant numbers, EudraCT numbers for European trials). The status module records overall status, which cycles through values including Not yet recruiting, Recruiting, Enrolling by invitation, Active not recruiting, Completed, Terminated, Withdrawn, and Suspended. The distinction between Terminated (study stopped before completion) and Withdrawn (study never started enrolling) matters for bias analyses: terminated trials are more likely to have interim results that influenced the stop decision.

The design module specifies study type (Interventional or Observational), phase for drug trials (Phase 1, Phase 2, Phase 3, Phase 4, or combinations like Phase 1/Phase 2), allocation (randomized or non-randomized), masking (open, single-blind, double-blind, triple-blind, quadruple-blind), primary purpose (treatment, prevention, diagnostic, supportive care, screening, health services research, basic science, device feasibility), and enrollment target. Phase 1 trials test safety and dosing in small populations, typically 20 to 80 participants. Phase 2 trials assess efficacy and side effects in larger groups. Phase 3 trials are large randomized controlled trials comparing the intervention to placebo or standard of care—the pivotal evidence base for FDA approval. Phase 4 trials are post-approval studies examining long-term effects, new populations, or new indications.

The conditions module lists the diseases or conditions being studied, drawn from MeSH (Medical Subject Headings) or free text. The interventions module lists each experimental and control intervention with its type (drug, biologic, device, procedure, radiation, behavioral, dietary supplement, genetic, combination product, diagnostic test, or other), name, and description. Drug interventions include the investigational name and, once the product is approved, the generic and brand names.

The outcomes module specifies primary outcome measures—the endpoints whose results will determine success or failure of the trial—and secondary outcome measures. Each outcome includes a description, time frame, and measure type. Pre-specifying outcomes before data collection is the methodological reason trial registration exists: a sponsor who registers outcomes before seeing data cannot selectively emphasize favorable endpoints after the fact.

The eligibility module defines inclusion and exclusion criteria in free text, sex, minimum and maximum age, and whether healthy volunteers are accepted. The contacts and locations module lists the facilities where the trial is conducted, each with country, city, and contact information for the facility investigator. The sponsor and collaborators module identifies the lead sponsor (the entity responsible for initiating and managing the study) and any collaborators (other organizations contributing resources or oversight).

If results have been submitted, the record gains a results section with structured tables for each primary and secondary outcome measure (including the actual measured values, statistical test results, and confidence intervals), an adverse events summary (serious and non-serious events by system organ class and preferred term), and links to publications in PubMed. The results section is the functional equivalent of a journal article abstract for trials that may never be formally published.

The results reporting compliance failure

The 12-month results reporting requirement has been systematically violated since it took effect. Multiple independent analyses using ClinicalTrials.gov data have found that more than 50 percent of completed applicable clinical trials have not reported results within the legally required window—and in many cases have never reported results at all.

A 2015 study in the BMJ examined 13,327 trials registered between 2007 and 2010 with a primary completion date before 2013; only 13.4 percent had results in ClinicalTrials.gov within 12 months of primary completion. A 2019 study in Science tracking 4,209 trials funded by industry, NIH, and other government agencies found that 36 percent of trials had not reported results two years after primary completion. The compliance rate for industry-sponsored trials was modestly higher than for academic or NIH-funded trials—a counterintuitive finding that likely reflects the regulatory pressure on pharmaceutical companies preparing FDA submissions. A 2023 analysis of NIH-funded trials found that 39 percent had not reported results within 24 months.

The compliance gap is not uniform across trial types. Phase 4 post-marketing trials have among the worst reporting rates. Trials that terminate early are substantially less likely to report results than trials that complete on schedule—precisely the trials where interim data is most clinically relevant. Trials in oncology have higher reporting rates than trials in psychiatry or behavioral medicine.

The downstream consequence is that the medical literature systematically overestimates the efficacy of treatments. A trial with a negative primary outcome is far less likely to be published in a peer-reviewed journal than a trial with a positive outcome— and if it is not published, and if results are not reported to ClinicalTrials.gov, the negative data simply disappears from the public record. Systematic reviewers and meta-analysts who rely on the published literature are working with a biased sample. ClinicalTrials.gov is the only mechanism that creates a complete denominator: every trial that registered must be accounted for, even if the results are unfavorable.

The FDAAA enforcement gap

Congress authorized HHS to levy civil monetary penalties of up to $10,000 per day for failure to register or report results on time. Sustained non-compliance on a single trial could theoretically result in penalties exceeding $3.6 million per year. NIH- funded investigators who fail to comply risk having grant funding withheld.

In practice, HHS has almost never used this authority. As of the mid-2020s, no pharmaceutical company or academic institution has been fined under FDAAA 801 for results non-reporting. The Office of Scientific Integrity and the relevant NIH institutes have sent non-compliance notices and threat letters, but enforcement has not escalated to penalties. The 2016 Final Rule added a mechanism for NIH to withhold funding from non-compliant investigators, and NIH has used this mechanism in a small number of cases—but the threat has not produced systematic compliance.

The compliance tracking tool at ClinicalTrials.gov itself—accessible through the Expert Search interface—identifies trials that are overdue. The International Committee of Medical Journal Editors (ICMJE) requires trial registration as a condition of publication in member journals, which provides some independent enforcement pressure. A handful of investigative journalists and research integrity organizations, including the AllTrials campaign, have used ClinicalTrials.gov data to publish lists of non-compliant sponsors. These reputational mechanisms have had more visible effect than HHS enforcement.

Accessing the data: API v2 and the AACT database

ClinicalTrials.gov provides two primary programmatic access paths.

The ClinicalTrials.gov API v2 at clinicaltrials.gov/api/v2/studies is a RESTful JSON API that replaced the older v1 endpoint. It supports querying by condition, intervention, sponsor, NCT number, status, phase, date ranges, and free-text terms. The API supports cursor-based pagination via the pageToken parameter and returns structured JSON with the full study record in nested modules. The fields parameter controls which fields are included in the response, allowing lightweight bulk queries that return only the fields needed without downloading the full record for each study. Rate limiting is applied; authenticated requests with an API key receive higher limits.

The AACT database—Aggregate Analysis of ClinicalTrials.gov—is maintained by the Clinical Trials Transformation Initiative (CTTI) at Duke University and is available at aact-importer.ctti-clinicaltrials.org. AACT transforms the ClinicalTrials.gov XML export into a well-structured PostgreSQL relational database updated daily. The schema normalizes the nested API response into flat tables: studies, eligibilities, conditions, interventions, outcome_measures, reported_event_totals, result_contacts, sponsors, facilities, and approximately 40 other tables. Full database dumps in PostgreSQL format are available for download; a public read-only PostgreSQL server at aact-db.ctti-clinicaltrials.org port 5432 is also available for direct SQL queries.

For analysts familiar with SQL, AACT is substantially easier to work with than the API for complex analytical questions. A query joining studies, sponsors, outcome_measures, and reported_events to compute per-sponsor non-compliance rates is straightforward in PostgreSQL and would require many paginated API calls and post-processing in Python. AACT is the preferred access path for academic research and systematic reviews.

Publication bias: why registration matters

Publication bias is the tendency for positive trials—those showing that the treatment works—to be published faster and more often than negative or null trials. The effect has been documented across therapeutic areas for decades. A 2008 analysis in the New England Journal of Medicine examined 74 trials of antidepressants submitted to the FDA: 38 had positive outcomes and 37 of those were published; 36 had negative or questionable outcomes and only 14 were published, while 22 were not published at all. The published literature showed a 94 percent positive effect rate; the FDA's complete dataset showed 51 percent.

Trial registration addresses publication bias by creating a complete pre-committed list of trials that must be accounted for. If every trial that starts is registered, and if results reporting requirements are followed, a systematic reviewer can identify every trial ever conducted on a treatment, including those whose negative results were never submitted to a journal. The registered trial record is the denominator.

Outcome switching is a related problem. A sponsor who pre-registers a trial with overall survival as the primary endpoint but later publishes using progression-free survival—a less stringent endpoint where the drug performed better—has engaged in outcome switching. Comparing the registered outcomes module against the published paper's reported outcomes is a direct investigative technique used by researchers and journalists. The COMPare project at the Centre for Evidence-Based Medicine published analyses of published trials in major journals, finding outcome switching in the majority of them. ClinicalTrials.gov is the document of record.

How journalists and researchers use the registry

Medical journalists have developed a set of investigative techniques built directly on ClinicalTrials.gov data. Searching for all completed trials of a drug and counting what fraction have results reveals a sponsor's compliance posture—and which negative results have been suppressed. Comparing the registered primary outcome to the outcome emphasized in the published paper identifies outcome switching. Examining the history of protocol amendments—ClinicalTrials.gov records each version of the protocol with timestamps—reveals whether endpoints were changed after the trial began enrolling but before results were known.

Researchers studying off-label drug use examine the indicated conditions field and the study population eligibility criteria to map the difference between the population studied in trials and the broader population treated in clinical practice. If a drug is approved based on trials that excluded patients over 75 or patients with renal impairment, ClinicalTrials.gov records show this precisely. The gap between the enrolled population and the marketed population is the analytical foundation for off-label safety investigations.

Sponsor behavior analysis uses the sponsor and collaborators fields combined with the status and results fields to characterize which companies and academic institutions are completing trials on time, which are terminating trials early and why, and which are failing to report results. A sponsor with a pattern of terminating trials that are later resumed under a new NCT number with slightly modified designs—a practice called “salami slicing”—is identifiable through ClinicalTrials.gov history. Similarly, a sponsor that registers dozens of trials for a drug and then reports results selectively from the positive ones is visible in the aggregate record.

Industry analysts use the pipeline view—all Recruiting and Active not recruiting Phase 2 and Phase 3 trials for a therapeutic area—to map the competitive landscape before public announcements. Because registration must occur before enrollment begins, ClinicalTrials.gov often reveals a sponsor's pipeline before any press release. The enrollment count field, combined with the start date and estimated completion date, allows analysts to project when top-line data will be available.

Python: pulling trials for a drug class and computing reporting lag

The following script queries the ClinicalTrials.gov API v2 for trials of GLP-1 receptor agonists (semaglutide, liraglutide, tirzepatide), parses the relevant fields into a DataFrame, and computes registration-to-results reporting lag for completed trials. It also identifies trials that are past the 12-month deadline without results.

import requests
import pandas as pd
from datetime import datetime, date

# ---------------------------------------------------------------
# 1. Query the ClinicalTrials.gov API v2 for trials of a drug class
#    API base: https://clinicaltrials.gov/api/v2/studies
#    Documentation: https://clinicaltrials.gov/data-api/api
# ---------------------------------------------------------------

BASE_URL = "https://clinicaltrials.gov/api/v2/studies"

def fetch_trials(query_term, max_studies=1000):
    """
    Fetch trials matching query_term.
    Returns a list of study dicts from the API.
    The API uses cursor-based pagination via nextPageToken.
    """
    all_studies = []
    params = {
        "query.intr": query_term,        # filter on intervention name
        "fields": (
            "NCTId,BriefTitle,OverallStatus,Phase,"
            "StartDate,PrimaryCompletionDate,"
            "ResultsFirstSubmitDate,StudyFirstSubmitDate,"
            "LeadSponsorName,Condition,EnrollmentCount,"
            "StudyType,HasResults"
        ),
        "pageSize": 100,
        "format": "json",
    }
    page = 0
    next_token = None

    while len(all_studies) < max_studies:
        if next_token:
            params["pageToken"] = next_token
        r = requests.get(BASE_URL, params=params, timeout=30)
        r.raise_for_status()
        data = r.json()

        studies = data.get("studies", [])
        if not studies:
            break
        all_studies.extend(studies)

        next_token = data.get("nextPageToken")
        if not next_token:
            break
        page += 1

    print("Fetched " + str(len(all_studies)) + " studies for: " + query_term)
    return all_studies

# ---------------------------------------------------------------
# 2. Parse relevant fields into a flat DataFrame
# ---------------------------------------------------------------

def parse_studies(studies):
    rows = []
    for s in studies:
        proto = s.get("protocolSection", {})
        id_mod = proto.get("identificationModule", {})
        status_mod = proto.get("statusModule", {})
        design_mod = proto.get("designModule", {})
        sponsor_mod = proto.get("sponsorCollaboratorsModule", {})
        cond_mod = proto.get("conditionsModule", {})

        nct_id = id_mod.get("nctId", "")
        title = id_mod.get("briefTitle", "")
        status = status_mod.get("overallStatus", "")
        start_date = status_mod.get("startDateStruct", {}).get("date", "")
        pcd = status_mod.get("primaryCompletionDateStruct", {}).get("date", "")
        first_submit = status_mod.get("studyFirstSubmitDate", "")
        results_submit = status_mod.get("resultsFirstSubmitDate", "")
        phase_list = design_mod.get("phases", [])
        phase = ", ".join(phase_list) if phase_list else ""
        study_type = design_mod.get("studyType", "")
        enrollment = design_mod.get("enrollmentInfo", {}).get("count", None)
        sponsor = sponsor_mod.get("leadSponsor", {}).get("name", "")
        conditions = "; ".join(cond_mod.get("conditions", []))
        has_results = s.get("hasResults", False)

        rows.append({
            "nct_id": nct_id,
            "title": title,
            "status": status,
            "phase": phase,
            "study_type": study_type,
            "sponsor": sponsor,
            "conditions": conditions,
            "enrollment": enrollment,
            "start_date": start_date,
            "primary_completion_date": pcd,
            "first_submit_date": first_submit,
            "results_submit_date": results_submit,
            "has_results": has_results,
        })
    return pd.DataFrame(rows)

# ---------------------------------------------------------------
# 3. Compute registration-to-results reporting lag
#    For each completed trial that HAS reported results,
#    compute days from primary completion date to results submission.
#    For completed trials WITHOUT results, flag as non-compliant.
# ---------------------------------------------------------------

def parse_date(s):
    if not s:
        return None
    for fmt in ("%Y-%m-%d", "%Y-%m", "%Y"):
        try:
            return datetime.strptime(s, fmt).date()
        except ValueError:
            continue
    return None

def compute_lag(df):
    df = df.copy()
    df["pcd_parsed"] = df["primary_completion_date"].apply(parse_date)
    df["results_parsed"] = df["results_submit_date"].apply(parse_date)

    today = date.today()
    one_year_days = 365

    # Completed trials: status is COMPLETED and PCD is in the past
    completed = df[
        (df["status"] == "COMPLETED") &
        df["pcd_parsed"].notna() &
        (df["pcd_parsed"] < today)
    ].copy()

    print("Completed trials: " + str(len(completed)))

    # Trials with results submitted
    with_results = completed[completed["results_parsed"].notna()].copy()
    with_results["lag_days"] = (
        with_results["results_parsed"] - with_results["pcd_parsed"]
    ).dt.days

    print("With results reported: " + str(len(with_results)))
    if len(with_results) > 0:
        median_lag = with_results["lag_days"].median()
        over_12mo = (with_results["lag_days"] > one_year_days).sum()
        print("Median lag (days): " + str(round(median_lag, 0)))
        print("Reported late (> 12 months after PCD): " + str(over_12mo)
              + " (" + str(round(100 * over_12mo / len(with_results), 1)) + "%)")

    # Trials without results that are past the 12-month deadline
    no_results = completed[completed["results_parsed"].isna()].copy()
    overdue = no_results[
        (today - no_results["pcd_parsed"]).apply(lambda x: x.days if pd.notna(x) else 0)
        > one_year_days
    ]
    print("Overdue (no results, > 12 months past PCD): " + str(len(overdue)))
    non_compliance_rate = len(overdue) / len(completed) * 100 if len(completed) > 0 else 0
    print("Non-compliance rate: " + str(round(non_compliance_rate, 1)) + "%")

    return with_results, overdue

# ---------------------------------------------------------------
# 4. Run for GLP-1 receptor agonists (semaglutide, liraglutide)
# ---------------------------------------------------------------

studies_raw = fetch_trials("semaglutide OR liraglutide OR tirzepatide", max_studies=500)
df = parse_studies(studies_raw)

print("\nPhase breakdown:")
print(df["phase"].value_counts().head(10).to_string())

print("\nStatus breakdown:")
print(df["status"].value_counts().head(10).to_string())

with_results, overdue = compute_lag(df)

if len(with_results) > 0:
    print("\nTop sponsors by trial count:")
    print(df["sponsor"].value_counts().head(5).to_string())

    print("\nSample overdue trials (NCT ID, sponsor, PCD):")
    cols = ["nct_id", "sponsor", "primary_completion_date"]
    print(overdue[cols].head(5).to_string(index=False))

The data as a public accountability tool

ClinicalTrials.gov is unusual among federal databases in that it was explicitly designed as an accountability mechanism, not just an informational resource. The mandatory registration requirement was enacted precisely because the voluntary system had failed—sponsors were registering trials selectively and reporting results even more selectively. The database exists to make selective reporting impossible in principle, even if enforcement has made it possible in practice.

For any analysis of clinical evidence—systematic reviews, regulatory submissions, treatment guidelines, health technology assessments—ClinicalTrials.gov is the starting point. The published literature is a biased sample of the registered trials. A treatment whose efficacy appears robust in a PubMed search may look substantially weaker when all registered trials, including those with unreported negative results, are factored in. The registry does not solve publication bias, but it creates the evidence needed to measure its magnitude.


Related writing

NIH Research Grant Data: Mapping $40 Billion in Annual Biomedical Funding — The NIH Reporter system publishes comprehensive award data on every grant the National Institutes of Health makes. Here is what the dataset contains and how to query it for funding flows by Institute, activity code, and geographic concentration.

FDA FAERS: The Adverse Drug Event Database Behind Post-Market Drug Safety — FAERS holds more than 30 million adverse drug event reports across seven relational files. Here is the schema, MedDRA coding, disproportionality analysis, the PRIMARYID deduplication problem, and a Python workflow for computing PRR signals.

FDA Warning Letters: The Public Enforcement Record for 100,000+ Regulatory Actions — The FDA publishes every warning letter it sends on fda.gov—pharmaceutical cGMP violations, food HACCP failures, device 510(k) deficiencies, dietary supplement claims, and clinical investigator fraud. Here is the data structure, bulk access methods, and how to analyze enforcement actions by category and year.