Technical writing

The Federal Research Enterprise: Joining NSF, NIH, and Research-Misconduct Data

· 12 min read· AI Analytics
Research FundingNSFNIHResearch IntegrityData Engineering

Federal science money flows through a small number of channels, and almost all of it is documented. The National Science Foundation records the awards that fund the physical, social, and engineering sciences; the National Institutes of Health record the grants that dominate biomedical research; and the Office of Research Integrity records the findings—fabrication, falsification, plagiarism—that mark where the integrity system failed. Each names the grantee institution and the principal investigator. Join them on those two keys and the flow of federal research dollars and its accountability layer come together in one view: who gets funded, for what, and what share of the funded work later draws a misconduct finding.

This article covers what the three federal research datasets are and how they fit together; the two funding pillars—NSF for the non-medical sciences and engineering, NIH for biomedicine—and how their dollar scales differ; the Office of Research Integrity as the oversight layer that sits against the grant record, and the limits of what it covers; the two join keys, the grantee institution and the principal investigator, and the name-normalization problem that is the central engineering obstacle to using them; the shape of our three tables and the columns that carry the keys; the science-policy questions the assembled data answers—funding concentration, shifts across fields and administrations, the misconduct rate against the grant record, and the consequences of misconduct measured against the dollars at stake; a worked Python workflow that aggregates awards by institution, builds an investigator's funding history, and joins ORI findings to it; and the caveats—entity resolution, the PHS-only scope of ORI, and the difference between correlation and causation—that every analyst must internalize.

What the three datasets are

The federal research enterprise is large, but its public data footprint is concentrated in a few authoritative sources. This guide assembles three of them. The first is the NSF awards record: every grant the National Science Foundation makes, with the award amount, the title and abstract, the awardee institution, the principal investigator, the NSF program and directorate, and the start and end dates. The second is the NIH grants record, published through NIH RePORTER and its bulk ExPORTER files: every research project the National Institutes of Health fund, with the project number, the funding amount, the organization, the principal investigators, the administering institute or center, and the fiscal year. The third is the ORI misconduct record: the case summaries the HHS Office of Research Integrity publishes when it finds research misconduct in Public Health Service–funded work, naming the respondent, the institution, the nature of the misconduct, and the administrative actions imposed.

What makes the three worth joining rather than reading separately is that they document different moments of the same lifecycle. The award and grant records are the input side: where the money goes, by institution, investigator, field, and program. The misconduct record is the accountability side: where the integrity system intervened. Read alone, the funding data is a map of patronage with no quality signal; read alone, the misconduct data is a list of bad actors with no denominator. Read together—keyed on the institution and the investigator that both sides name—they become something neither is alone: a view of the federal research enterprise that pairs the flow of money with the record of its oversight, so that a funded investigator's grant history can be lined up against any finding against them, and the misconduct record can be given the denominator—the universe of funded work—it otherwise lacks.

The two funding pillars: NSF and NIH

The two agencies that anchor this picture are the National Science Foundation and the National Institutes of Health, and the division of labor between them is clean. The National Science Foundation, an independent federal agency created in 1950, funds a large share of the federally supported basic research conducted at US colleges and universities across the non-medical sciences and engineering—mathematics, physics, chemistry, computer science, the geosciences, biology outside the biomedical core, the social and behavioral sciences, and the engineering disciplines. Its grants are made competitively through a peer-review process organized by directorate and program, and its statutory mission is to promote the progress of science broadly rather than to pursue a single applied goal. The NSF awards record is therefore the closest thing the country has to a comprehensive ledger of who gets funded to do basic science outside of medicine.

The National Institutes of Health, part of the Department of Health and Human Services, is the larger of the two by a wide margin—its budget runs to tens of billions of dollars a year, and it is the dominant funder of biomedical research in the United States and, by most measures, the world. NIH is not a single grant-making body but a federation of institutes and centers, each focused on a disease area or a stage of the research enterprise, that make grants to universities, hospitals, research institutes, and small businesses. Its grants flow through a vast apparatus of mechanisms—the investigator-initiated research project grant being the archetype—and the RePORTER system makes the resulting portfolio searchable down to the individual project. Because NIH dollars dwarf NSF dollars, any combined view of federal research funding is numerically dominated by biomedicine; NSF supplies the breadth across fields, NIH supplies the bulk of the money.

Taken together, the two records show where federal science money actually goes—by institution, by investigator, by field, and by program. That is a more powerful statement than it sounds. The award and grant records are not summaries or appropriations totals; they are the line items, one row per award, each tied to a named recipient and a named researcher. From them an analyst can reconstruct the entire distribution of federal research support: how much each university receives, how it splits between agencies and fields, which investigators are the largest recipients, and how all of that has shifted over time. The two agencies are not the whole of federal research funding—the Department of Energy, NASA, the Department of Defense, and the Department of Agriculture all fund substantial research too—but NSF and NIH are the two pillars, and between them they account for the largest share of the federally funded basic and biomedical research conducted at American institutions.

The accountability layer: ORI and research misconduct

Against the grant record sits the oversight record. The Office of Research Integrity (ORI), within the Department of Health and Human Services, oversees and directs Public Health Service–funded research-integrity activities—most of which means NIH-funded research. ORI's remit is research misconduct, which federal regulation defines narrowly and precisely as fabrication, falsification, or plagiarism in proposing, performing, or reviewing research, or in reporting research results—the canonical “FFP” triad. Crucially, research misconduct does not include honest error or honest differences of scientific opinion; it is reserved for the deliberate corruption of the research record, and a finding requires that the misconduct be committed intentionally, knowingly, or recklessly and be proven by a preponderance of the evidence.

The institutional architecture matters for the data. Most misconduct investigations are conducted in the first instance by the institution where the research was done—the university or research center has the primary responsibility to inquire, investigate, and adjudicate—with ORI overseeing the process, reviewing the institution's findings, and making its own determinations. When ORI concludes that misconduct occurred, it can impose administrative actions: requiring supervision of the respondent's research, requiring correction or retraction of the affected publications, debarring the respondent from receiving federal funding for a stated period, and barring them from serving on PHS advisory committees. ORI then publishes its findingsas case summaries and in the Federal Register, naming the respondent, the institution, the specific findings of fabrication, falsification, or plagiarism, and the administrative actions imposed. That published record is the misconduct dataset: the accountability layer that, because it names the institution and the investigator, can be lined up directly against the grants involved.

The join keys: institution and investigator

The reason the three datasets can be assembled at all is that they share two identifying dimensions. Both the funding records and the misconduct findings name the grantee institution and the principal investigator. Those are the join keys, and they support the three core operations this guide is built around. The institution key lets an analyst aggregate funding by university— summing NSF awards and NIH grants for a given institution to measure its total federal research support and how that support splits across agencies and fields. The investigator key lets an analyst trace an investigator's grant history—collecting every award and grant on which a researcher is named as principal investigator, across agencies and over time, into a single funding biography. And using both keys together lets an analyst line ORI findings up against the grants involved— matching a misconduct respondent to the funded work in their name, so that a finding is no longer an isolated case summary but a marker placed on a specific stretch of a funding record.

The obstacle—and it is the central engineering challenge of this entire exercise—is that the keys are names, not stable identifiers. Institution names and investigator names are recorded as free text, and they vary relentlessly across datasets and within them. “University of California, Berkeley” appears as “UC Berkeley,” “Univ of California-Berkeley,” and “Regents of the University of California”; a single medical school may be recorded under the university, the health system, or the hospital. Investigator names carry the full burden of homonymy and formatting: initials versus full first names, maiden and married names, transliteration variants, suffixes, and the simple fact that many people share a name. The funding agencies have moved toward persistent researcher identifiers, and NIH and NSF both attach institutional identifiers, but those identifiers are not consistently present across the historical record and are largely absent from the ORI case summaries—which means that joining the misconduct findings to the grant record falls back, in practice, on name matching. Investigator-name and institution-name normalization is therefore the usual obstacle, and the bulk of the real work in any honest version of this analysis is entity resolution, not querying.

What the assembled data looks like

In our database the three records are stored as three tables— nsf_awards, nih_grants, and ori_misconduct—each at its natural grain (one award, one project, one finding per row) and each carrying the institution and investigator identifiers that make the join possible. The columns that matter for the join are the funding amounts, the field and program, the dates, and, above all, the institution and principal-investigator name fields shared across all three:

-- nsf_awards (one row per NSF award)
award_id              -- NSF award identifier
title                 -- project title
awardee_name          -- grantee institution (JOIN KEY)
pi_name               -- principal investigator (JOIN KEY)
funds_obligated_amt   -- dollars obligated
directorate / program -- NSF organizational unit and program
start_date / end_date -- award period

-- nih_grants (one row per NIH project)
project_num           -- NIH project number (activity + serial + year)
project_title         -- project title
org_name              -- grantee organization (JOIN KEY)
pi_names              -- principal investigator(s) (JOIN KEY)
award_amount          -- total cost awarded
ic_name               -- administering institute / center
fiscal_year           -- federal fiscal year

-- ori_misconduct (one row per finding)
respondent            -- investigator found responsible (JOIN KEY)
institution           -- institution where work was done (JOIN KEY)
finding               -- fabrication / falsification / plagiarism
administrative_action -- supervision, retraction, funding bar, term
finding_date          -- date of the ORI finding

The shared columns are the point. In nsf_awards the keys are awardee_name and pi_name; in nih_grants they are org_name and pi_names; in ori_misconduct they are institution and respondent. The columns are named differently because they come from three different agencies, but they denote the same two real-world things: the place where the research was done and the person who led it. Everything else in the tables—the dollar amounts, the program and institute, the dates, the nature of the finding—is the payload that becomes meaningful once the rows are aligned on those keys. The work, as the previous section stressed, is normalizing those names and aligning the grant and finding records; the schema makes the destination clear, but it does not make the entity resolution free.

The questions the data answers

Assembled, the three datasets answer the questions that sit behind science policy— questions about how federal research support is distributed, how it changes, and how its oversight performs. The first is funding concentration. How concentrated is federal research funding among elite institutions? Aggregating NSF and NIH dollars by university produces the empirical distribution directly, and it is famously skewed: a relatively small set of research-intensive universities and academic medical centers absorbs a large fraction of the total, and the combined NSF-plus-NIH view shows whether an institution's prominence rests on biomedical strength, breadth across the other sciences, or both. That concentration is a recurring subject of science-policy debate—about geographic equity, about the advantage of incumbency in peer review, about whether the system over-rewards the already strong—and the joined data quantifies it rather than merely asserting it.

The second question is how funding shifts across fields and administrations. Because every award and grant carries a field or program and a date, the data supports time series: the rise of computing and the life sciences relative to the physical sciences, the budget swings that follow appropriations cycles and changes in administration, the targeted surges that accompany national priorities (a pandemic, a push on a particular disease, a strategic-technology initiative). Reading the NSF program structure alongside the NIH institute structure lets an analyst watch the federal research portfolio rebalance across decades—which is to say, watch national priorities expressed in dollars. The grant record is, among other things, a longitudinal record of what the country has decided is worth knowing.

The third and fourth questions are the ones the misconduct join uniquely enables. What share of funded work later draws misconduct findings? Lining the ORI respondents up against the funded investigator population gives the misconduct record a denominator—letting an analyst express findings not as a raw count but as a rate against the universe of funded researchers, and to ask how that rate varies by field, institution, and career stage. And how do the consequences of misconduct compare to the dollars at stake? Setting the administrative actions ORI imposes—the length of a funding bar, the requirement to retract—against the funding history of the respondent measures whether the accountability is proportionate to the scale of the federal investment that the misconduct touched. These are not idle questions: they go to whether the integrity system is calibrated to the money it is meant to protect.

Where the data lives and how it is accessed

All three sources are public and key-free, which is what makes the assembly tractable for an outside analyst. NSF exposes its awards through a public awards API that accepts queries by awardee, investigator, program, and date and returns structured award records, and it publishes bulk award downloads for those who want the whole corpus rather than targeted queries. NIH publishes its grants through RePORTER—both an interactive web interface and a REST API that accepts a JSON criteria object specifying organizations, investigators, fiscal years, and institutes—and through the ExPORTER bulk files, annual flat-file releases of the full project, abstract, and publication-linkage data that are the right tool for portfolio-scale work. ORI publishes its findings as case-summary listings on its website and in the Federal Register; these are less structured than the funding APIs—they are human-readable summaries rather than a queryable database—so in practice an analyst extracts the respondent, institution, finding, and action fields from the listings into a structured table before joining.

The asymmetry between the sources is itself instructive. The two funding records are mature, structured, machine-readable datasets backed by APIs and bulk files; the misconduct record is a curated, narrative, comparatively small set of published findings. That mismatch is not an accident of engineering—it reflects the underlying reality that funding is a high-volume administrative process while a misconduct finding is a rare, deliberate, heavily adjudicated event. The practical consequence for the join is that the funding side supplies the clean keys and the large denominator, while the misconduct side supplies a small, hand-curated set of names that must be matched carefully and conservatively against it.

Python workflow: aggregate funding, build a history, join the findings

The script below performs the three core operations end to end. It pulls NSF awards from the NSF awards API and NIH projects from the RePORTER v2 API for a single institution and sums the federal research dollars on each side; it groups the NSF awards by normalized investigator name to build a per-investigator funding history; and it joins a structured table of ORI case summaries—extracted from the published listings—to the funded investigator population by normalized name, surfacing any respondent who matches a funded principal investigator at the institution. No API keys are required. Requirements: requests and pandas. The norm helper is deliberately crude and flagged as such—it stands in for the real entity-resolution layer that any serious version of this analysis must build.

import requests, time
import pandas as pd
from collections import defaultdict

# Three federal research sources, all public and key-free:
#   1. NSF Awards API   -- physical, social, engineering sciences
#   2. NIH RePORTER API -- biomedical research grants
#   3. ORI case summaries -- misconduct findings (PHS / largely NIH funded)
# The join keys are the grantee INSTITUTION and the principal INVESTIGATOR.
NSF = "https://api.nsf.gov/services/v1/awards.json"
NIH = "https://api.reporter.nih.gov/v2/projects/search"


def nsf_awards(institution, max_rows=500):
    # NSF awards API: free-text institution match, paged 25 at a time.
    out, offset = [], 1
    fields = "id,title,awardeeName,piFirstName,piLastName,fundsObligatedAmt,date"
    while offset <= max_rows:
        params = {"awardeeName": institution, "printFields": fields,
                  "offset": offset, "rpp": 25}
        r = requests.get(NSF, params=params, timeout=60)
        r.raise_for_status()
        rows = r.json().get("response", {}).get("award", [])
        if not rows:
            break
        out.extend(rows)
        offset += 25
        time.sleep(0.3)
    return pd.DataFrame(out)


def nih_grants(institution, max_rows=500):
    # NIH RePORTER v2: POST a JSON criteria object; org name is a filter.
    out, offset = [], 0
    while offset < max_rows:
        body = {"criteria": {"org_names": [institution]},
                "include_fields": ["ProjectNum", "ProjectTitle",
                                   "Organization", "PrincipalInvestigators",
                                   "AwardAmount", "FiscalYear"],
                "offset": offset, "limit": 100}
        r = requests.post(NIH, json=body, timeout=60)
        r.raise_for_status()
        rows = r.json().get("results", [])
        if not rows:
            break
        out.extend(rows)
        offset += 100
        time.sleep(0.3)
    return pd.json_normalize(out)


def norm(name):
    # Institution / investigator names need normalization before they join.
    # This is a FIRST PASS only -- real entity resolution is harder.
    s = (name or "").upper().strip()
    for junk in [" UNIVERSITY", " UNIV", " THE ", "  "]:
        s = s.replace(junk, " " if junk == "  " else junk)
    return " ".join(s.split())


# --- 1. Aggregate NSF + NIH funding for one institution -----------------
inst = "Stanford University"
nsf = nsf_awards(inst)
nih = nih_grants(inst)
nsf_total = pd.to_numeric(nsf.get("fundsObligatedAmt"), errors="coerce").sum()
nih_total = pd.to_numeric(nih.get("award_amount"), errors="coerce").sum()
print(f"{inst}: NSF ${nsf_total:,.0f} across {len(nsf):,} awards; "
      f"NIH ${nih_total:,.0f} across {len(nih):,} projects")

# --- 2. Build one investigator’s funding history ------------------------
nsf["pi"] = (nsf.get("piFirstName", "").fillna("") + " " +
             nsf.get("piLastName", "").fillna("")).map(norm)
by_pi = nsf.groupby("pi")["fundsObligatedAmt"].apply(
    lambda s: pd.to_numeric(s, errors="coerce").sum())
print("\nTop NSF-funded investigators at this institution:")
for pi, amt in by_pi.sort_values(ascending=False).head(10).items():
    print(f"  {pi[:34]:<34} ${amt:>14,.0f}")

# --- 3. Join ORI misconduct findings against the grant record -----------
# ORI publishes case summaries (respondent name + institution). Load a
# local CSV scraped from the ORI case-summary listings, then match on the
# normalized investigator name.
ori = pd.read_csv("ori_case_summaries.csv")   # columns: respondent, institution
ori["_pi"] = ori["respondent"].map(norm)
funded_pis = set(by_pi.index)
flagged = ori[ori["_pi"].isin(funded_pis)]
print(f"\nORI respondents matching a funded PI at {inst}: {len(flagged)}")
for _, row in flagged.iterrows():
    print(f"  {row['respondent']}  ({row['institution']})")

Two things about the script deserve emphasis. First, the funding aggregation is the easy part: the NSF and NIH APIs return clean, structured rows, and summing dollars by institution or grouping awards by investigator is routine pandas. The hard part is the third step—the ORI join—and the script makes its fragility visible by routing the match through the same norm helper used for the investigators. A real implementation would replace that exact-normalized-string match with a proper entity-resolution pipeline: fuzzy matching with manual review, disambiguation by institution and time window, and confirmation against any persistent identifiers present, because a bare name match between an ORI respondent and a funded PI is a candidate to investigate, not a confirmed link. Second, for anything beyond a single institution—ranking universities nationally, building the full funded-investigator denominator—the NIH ExPORTER bulk files and the NSF bulk award downloads are far more efficient than paging the APIs thousands of times, and they ship the authoritative field definitions for each release.

Limitations and analytical caveats

The assembled view is powerful, but it rests on a join across three independently maintained datasets, and several limitations must be held firmly in mind before drawing conclusions.

Entity resolution is the dominant source of error.Because the join runs on names rather than stable identifiers, every conclusion is only as good as the name normalization behind it. Under-matching—failing to recognize that two spellings denote the same institution or person—fragments an investigator's funding history and misses true links between findings and grants. Over-matching— collapsing two distinct people who share a name—attributes funding or a misconduct finding to the wrong person, which in the misconduct context is a serious error with real reputational stakes. There is no shortcut here: any analyst who treats a normalized-string match as ground truth, rather than as a candidate requiring confirmation, will produce a result that is confidently wrong. The crude norm helper in the script is a placeholder for a problem that, done properly, is most of the work.

ORI covers only PHS-funded work. This is the single most important scoping caveat for the misconduct join. The Office of Research Integrity's jurisdiction extends to research funded by the Public Health Service—largely NIH—and not to research funded by NSF, the Department of Energy, NASA, or other agencies, each of which handles misconduct under its own authority (NSF's own Office of Inspector General, for example, oversees misconduct in NSF-funded work). The practical consequence is an asymmetry that an analyst must not forget: the ORI findings can be lined up against the NIH grant record on a roughly common footing, but they do not represent the misconduct universe for NSF-funded research. A combined NSF-plus-NIH funding denominator joined to an ORI-only misconduct numerator will understate misconduct on the NSF side to zero by construction—not because NSF-funded research is free of misconduct, but because ORI does not adjudicate it. Any rate computed across the combined funding population is therefore a rate against the wrong denominator unless it is restricted to PHS-funded work.

A finding is a rare, lagging, and partial signal.Research misconduct findings are few relative to the volume of funded work, they arrive years after the underlying research because investigation and adjudication are slow, and they capture only the misconduct that was detected, investigated, and proven to the regulatory standard. The published record is thus a floor, not a census: it understates the true incidence of misconduct, and it does so unevenly, because detection depends on whistleblowers, on institutional diligence, and on the visibility of the work. Treating the count of ORI findings as a measure of how much misconduct occurs—rather than how much was caught and proven—misreads the data. The funding-to-misconduct rate this analysis can compute is a rate of adjudicated findings against funded work, which is a meaningful and policy-relevant quantity, but it is not the underlying misconduct rate.

Funding totals are not impact, and a join is not a causal claim. Aggregating dollars by institution or investigator measures inputs, not outputs: a large funding total reflects the scale of the federal investment, not the quality, productivity, or societal value of the research it bought. And lining a misconduct finding up against a funding history establishes association, not cause—the funded work and the misconduct may overlap in name and institution without the finding pertaining to the specific grants summed. The disciplined reading keeps these layers separate: the funding data describes where the money went, the misconduct data describes where the integrity system intervened, and the join lets the two be examined side by side—a powerful frame for asking questions, and a treacherous one for asserting answers.

Held with those caveats, the three-table assembly—nsf_awards, nih_grants, and ori_misconduct, joined on the grantee institution and the principal investigator—is a uniquely complete map of the federal research enterprise: the two pillars that fund American science set against the accountability layer that polices its integrity, so that the flow of federal research money and the record of its oversight can, for the first time, be read in a single view.

Related writing

NIH Research Portfolio: The Federal Database Behind $50 Billion in Annual Biomedical Grants — The deep dive on the larger of the two funding pillars, the RePORTER and ExPORTER data behind NIH's biomedical grants that supplies the bulk of the dollars and the PHS-funded denominator this join rests on.

NSF Awards: The Federal Record of Who Gets US Science Funding — The companion deep dive on the NSF awards record that supplies the breadth across the non-medical sciences and engineering, and the awardee and investigator fields that form one half of every join in this guide.

ORI Research Misconduct Database: The Federal Record Behind Scientific Fraud and Fabrication — The accountability layer in detail: how the Office of Research Integrity defines fabrication, falsification, and plagiarism, adjudicates PHS-funded cases, and publishes the findings this analysis lines up against the grant record.