Technical writing

CMS Healthcare-Associated Infections: The Federal Record of CLABSI, CAUTI, MRSA, and C. diff in US Hospitals

· 12 min read· AI Analytics
CMSHospital InfectionsHAIPatient SafetyFederal Data

Some of the infections a patient carries out of a hospital were given to them by the hospital—through a central line into a vein, a catheter into the bladder, a surgical incision, or a ward where a resistant organism is circulating. The federal government counts those infections. CMS publishes, for every hospital in the country, how many bloodstream infections, urinary-tract infections, surgical-site infections, MRSA cases, and C. diff cases its patients acquired while in its care, measured against how many a hospital of that size and case mix would be expected to have. The result is roughly 173,000 hospital-by-measure records—each one a single hospital's Standardized Infection Ratio for a single infection, the public scorecard of how safe it is to be a patient there.

This article covers what the healthcare-associated infections dataset is and where it comes from; the data chain from the bedside through the CDC's National Healthcare Safety Network to CMS Care Compare; the six standard measures—CLABSI, CAUTI, the two surgical-site infections, MRSA bacteremia, and C. diff; the Standardized Infection Ratio and the observed-over-predicted arithmetic that makes hospitals of different size comparable; the real money riding on the numbers through the Hospital-Acquired Condition Reduction Program and Hospital Value-Based Purchasing; how the table joins to the rest of the hospital record through the CMS Certification Number; a Python workflow that pulls the HAI dataset from data.cms.gov, ranks hospitals by SIR, and computes the share of hospitals worse than the national benchmark by state; and the caveats—risk adjustment, small denominators, suppression, and reporting lag—that every analyst must internalize before treating an SIR as a verdict.

What the dataset is

A healthcare-associated infection (HAI) is an infection a patient acquires while receiving care that was not present or incubating when they arrived—an infection produced, in effect, by the encounter with the healthcare system itself. HAIs are among the most consequential and most preventable harms in medicine: they prolong stays, drive up cost, and kill patients who were admitted for something else entirely. Because they are largely preventable through disciplined practice—hand hygiene, sterile technique, prompt removal of unnecessary lines and catheters, antibiotic stewardship—they are a natural target for public measurement, and the federal government has built one of the most complete public records of patient-safety performance that exists anywhere.

CMS publishes hospital-level HAI measures on Care Compare, the consumer-facing successor to Hospital Compare, as part of the Hospital Inpatient Quality Reporting (IQR) program. The data is not collected by CMS directly; it is sourced from the CDC's National Healthcare Safety Network (NHSN), the surveillance system through which hospitals report infections using standardized case definitions. The grain of our table, cms_hospital_infections, is one row per hospital per measure per reporting period: a single hospital reporting six measures contributes six rows, and across several thousand hospitals and multiple measures and periods the file runs to roughly 173,000 hospital-by-measure records. Each row records the headline ratio, the raw counts behind it, the denominator of exposure, and a footnote telling the reader how the hospital compares to the national benchmark.

facility_id              -- CMS Certification Number (CCN), the hospital key
facility_name           -- hospital name
state                   -- two-letter state code
measure_id              -- e.g. HAI_1_SIR (CLABSI), HAI_2_SIR (CAUTI)...
measure_name            -- human-readable measure label
score                   -- the value, usually the Standardized Infection Ratio
compared_to_national    -- better / no different / worse than national benchmark
footnote                -- suppression / too-few-cases / not-available codes
-- the underlying NHSN numerator and denominator, by measure:
observed_infections     -- infections actually counted (numerator)
predicted_infections    -- infections the national baseline predicts
number_of_device_days   -- central-line- or catheter-days (the exposure base)
number_of_procedures    -- procedures performed (for surgical-site measures)
start_date / end_date   -- the reporting period the row covers

The facility_id is the load-bearing column. The CMS Certification Number (CCN) is the persistent six-character identifier CMS assigns to every certified hospital, and it is the key that ties an HAI row to the same hospital's readmissions, mortality, complications, star rating, cost reports, and ownership. The measure_id distinguishes which infection a row describes—and, importantly, which kind of value the score holds, because the same dataset carries the Standardized Infection Ratio rows alongside the raw observed-count, predicted-count, and device-day rows that feed it. The compared_to_national footnote is CMS's own three-way summary— better than, no different from, or worse than the national benchmark—which is a useful cross-check on any ratio an analyst computes from the underlying counts. The remaining columns are the substantive payload: the observed and predicted infection counts and the device-day or procedure denominator are what make the SIR reproducible and what let an analyst go behind the published ratio to the arithmetic that produced it.

From the bedside to NHSN to Care Compare

Understanding the data requires understanding its provenance, because the number on Care Compare is the end of a long, deliberate chain that begins with a person at a bedside. The chain has three links, and each shapes what the final figure can and cannot tell you.

The first link is case definition and surveillance. HAIs are not counted by an automated meter; they are identified by trained infection preventionists applying the CDC's standardized surveillance definitions. Whether a positive blood culture in a patient with a central line counts as a central-line-associated bloodstream infection—rather than a bloodstream infection arising from some other source—turns on a precise, criteria-based definition designed to be applied consistently across thousands of hospitals. This standardization is the reason the data is comparable at all, but it also means the count reflects the surveillance definition rather than every infection a clinician might recognize, and that hospitals must devote real staff effort to applying the definitions correctly.

The second link is the National Healthcare Safety Network (NHSN), the CDC's secure, web-based surveillance system into which hospitals report their infection events and the denominators—the device-days and the procedures—that establish how much exposure produced them. NHSN is the largest healthcare-associated infection tracking system in the country, and it is where the risk-adjustment models live: the CDC periodically establishes a national baseline from aggregated NHSN data and rebaselines it as practice improves, which is why an SIR is always interpreted relative to a stated baseline year. The third link is publication: CMS requires hospitals participating in the Hospital IQR program to report specified HAI measures to NHSN, draws those measures from the CDC, and posts them on Care Compare and through the data.cms.gov Provider Data Catalog. The practical consequence of this chain is that the data is a CDC surveillance product distributed through a CMS payment program— authoritative and standardized, but also bounded by the surveillance definitions and the reporting requirements at each link.

The six standard measures

The HAI dataset reports a defined set of measures, each targeting an infection that is both common and meaningfully preventable. Six are the workhorses, and each describes a distinct route by which the act of caring for a patient can infect them.

Central-line-associated bloodstream infections (CLABSI)are infections that enter the bloodstream through a central venous catheter—a line threaded into a large vein, typically in an intensive-care patient, to deliver fluids, medications, or nutrition. Because the line provides a direct conduit into the bloodstream, a CLABSI is among the most dangerous HAIs, and it is also among the most preventable through sterile insertion practice and prompt line removal. Catheter-associated urinary tract infections (CAUTI)arise from indwelling urinary catheters; they are extraordinarily common because urinary catheters are ubiquitous, and the single most effective prevention is simply removing the catheter as soon as it is no longer needed.

Surgical site infections (SSI) are infections of the incision or the operative space following surgery; CMS reports them for two tracer procedures—colon surgery and abdominal hysterectomy—chosen because they are performed widely enough to yield comparable volumes across hospitals and carry a meaningful infection risk. MRSA bloodstream infections track bacteremia caused by methicillin-resistant Staphylococcus aureus, an antibiotic-resistant organism whose presence reflects both the burden of resistant pathogens in a facility and the rigor of its infection-control practice. Clostridioides difficile (C. diff) infections are infections of the gut by an organism that flourishes when antibiotics disrupt the normal intestinal flora; the C. diff measure is therefore as much a marker of antibiotic stewardship and environmental cleaning as of any single procedure. Together the six span the principal mechanisms—lines, catheters, incisions, resistant organisms, and antibiotic-driven gut infection—through which hospital care produces infection, and each is reported as its own row, with its own SIR, for every reporting hospital.

The Standardized Infection Ratio

The headline metric for every measure is the Standardized Infection Ratio (SIR), and it is worth understanding precisely because so much rides on it and because it is so easy to misread. The SIR is a single number defined as observed infections divided by predicted infections: the count of infections a hospital actually had, over the count a national baseline predicts it should have had given its characteristics. An SIR of exactly 1.0 means the hospital had as many infections as predicted; an SIR above 1.0means it had more infections than predicted—worse than the baseline; and an SIR below 1.0 means it had fewer—better than the baseline. An SIR of 0.5, for instance, means the hospital had half as many infections as the national baseline predicted for a hospital like it.

The reason the SIR is constructed this way rather than as a simple infection rate is risk adjustment. A large academic medical center with a busy intensive-care unit and a high-acuity, immunocompromised patient population will, all else equal, have more central lines, more catheters, and more opportunities for infection than a small community hospital. Comparing their raw infection counts—or even their rates per patient—would penalize the complex hospital for treating complex patients. The predicted count in the SIR's denominator is computed from the CDC's national baseline model, which incorporates the factors known to drive infection risk—the type of patient-care location, the volume of device exposure, and, for surgical infections, procedure and patient characteristics— so that the ratio measures how a hospital performed relative to expectation for a hospital handling that case mix. This is what makes the SIR the right number for comparing a quaternary referral center to a rural hospital: both are scored against their own predicted baseline, not against each other's raw counts.

The denominator of exposure—distinct from the SIR's predicted-count denominator—is the other quantity the dataset carries, and it matters for interpretation. For CLABSI and CAUTI the exposure base is device-days: the sum, across patients, of the days each spent with a central line or a urinary catheter. For the surgical-site measures it is the number of procedures. Device-days are the honest denominator for device-associated infections because they capture how much opportunity for infection a hospital actually created—a hospital that removes lines promptly accrues fewer device-days and, all else equal, fewer infections. The raw observed count, the predicted count, and the exposure denominator are all published alongside the SIR precisely so that an analyst can reconstruct the ratio, judge how stable it is, and avoid the trap of reading a low SIR built on a tiny denominator as a strong result—the subject the caveats return to.

The money: HAC Reduction and value-based purchasing

These infection numbers are not merely informational. They carry real financial weight, because CMS has wired patient-safety measurement directly into how it pays hospitals, and the HAI measures feed two of the most consequential Medicare payment-adjustment programs. This is what elevates the dataset from a consumer scorecard to a record with money attached to every row.

The most pointed is the Hospital-Acquired Condition (HAC) Reduction Program. Established under the Affordable Care Act, the HAC Reduction Program identifies the worst-performing quartile of hospitals on a composite of patient-safety measures—the HAI measures prominent among them—and reduces all of their Medicare inpatient payments by one percent for the fiscal year. The structure is deliberately and unusually punitive: it is a relative penalty, not an absolute threshold, so a fixed share of hospitals is penalized every year regardless of how much infections improve nationally, and the penalty applies to every Medicare discharge, not just the cases where an infection occurred. A one-percent cut to total Medicare inpatient revenue is, for a large hospital, millions of dollars, which is why hospital leadership watches its HAI SIRs with an attention that goes well beyond public relations—a bad year on these measures can land a hospital in the penalized quartile and cost it real money.

The HAI measures also factor into the Hospital Value-Based Purchasing (VBP) program, which works in the opposite direction from the HAC penalty's pure stick. VBP withholds a percentage of each participating hospital's Medicare payments and redistributes the pool as incentive payments based on performance across several domains—including a safety domain that incorporates HAI measures—rewarding hospitals for both high achievement and demonstrated improvement. A hospital can therefore see the same infection performance reflected in two payment programs at once: as a potential one-percent HAC penalty if it falls into the worst quartile, and as a gain or loss in the value-based purchasing reconciliation. The cumulative effect is that an SIR is not just a published ratio; it is an input to the Medicare payment a hospital receives, which is the strongest possible reason both to take the numbers seriously and to scrutinize how they are constructed.

Joining by CCN to the rest of the hospital record

The HAI table is most valuable not in isolation but as one facet of the integrated hospital record, and the facility_id—the CCN—is the universal join key that makes the integration possible. Three joins matter most, and each turns an infection ratio into something an analyst can reason about in context.

The first is to the broader hospital quality and outcomes record—the readmissions, mortality, complications, and overall star rating CMS publishes for the same hospitals on Care Compare. Joining HAI to these by CCN lets an analyst ask whether infection performance tracks with other dimensions of quality: do hospitals with high CLABSI or C. diff SIRs also show elevated mortality or readmissions, or is infection control a distinct competency that does not move with the rest? The second join is to the hospital's structural and financial profile—its ownership type, bed size, teaching status, and the CMS cost reports—which supplies the context needed to interpret an SIR responsibly. Infection performance plausibly differs between for-profit and non-profit systems, between teaching and community hospitals, and between large and small facilities, and only the structural join lets an analyst control for those factors rather than confound them.

The third join is to ownership and corporate structure. Because the CCN ties a hospital to CMS's ownership records, an analyst can roll individual-hospital SIRs up to the health-system or corporate-owner level and ask whether infection performance is a property of management rather than of any single facility— whether the hospitals under one owner cluster together on the SIR distribution. This system-level view is exactly the kind of analysis that a single hospital's published ratio cannot support and that the CCN join makes routine. In every case the principle is the same: the SIR is a standardized, comparable number precisely so that it can be joined and aggregated across the hospital universe, and the CCN is what unlocks that comparability.

Analytical uses

A national, hospital-resolved, risk-adjusted record of healthcare-associated infections supports a distinctive set of analyses that no single hospital's scorecard can.

Ranking and benchmarking hospitals is the most immediate use. Because every hospital is scored on the same measures against the same baseline, an analyst can rank hospitals within a state, a metro area, or a peer group by SIR for any measure, identify the consistent high performers and the persistent outliers, and track whether a given hospital's SIR is trending up or down across reporting periods. The published comparison-to-national footnote provides a built-in sanity check on any such ranking. Geographic and state-level analysis rolls the hospital-level ratios up: computing the share of hospitals in each state with an SIR above 1.0, or the state median SIR by measure, surfaces regional patterns in infection control and lets an analyst relate them to state policy, reporting practice, and the structure of each state's hospital market.

Predicting HAC penalty exposure exploits the link between the measures and the money: because the HAC Reduction Program penalizes the worst-performing quartile, an analyst can use the HAI SIRs—alongside the other patient-safety measures in the composite—to estimate which hospitals are at risk of the one-percent penalty before CMS publishes the final determinations, information of real value to hospital finance and quality teams. Finally, studying ownership and structural correlates of infection performance brings the CCN joins to bear: combining SIRs with ownership type, bed size, teaching status, and corporate parent reveals whether infection control is associated with how a hospital is owned and run, the kind of system-level question that bears directly on policy and that only the integrated, hospital-resolved record can answer.

Python workflow: HAI data from the CMS Provider Data Catalog

The script below pulls the Healthcare Associated Infections – Hospital dataset from CMS's data.cms.gov Provider Data Catalog, filters to the central-line bloodstream infection (CLABSI) Standardized Infection Ratio measure, ranks hospitals by SIR, and computes the share of hospitals worse than the national benchmark—an SIR above 1.0—by state. No API key is required for public data. Because the published CSV lives at a hashed, timestamped path that changes with every release, the script resolves the current download URL at runtime from the Provider Data Catalog metastore API rather than hard-coding it, and resolves the column names defensively through a small helper; any production use should be validated against the current dataset metadata and should confirm the measure-ID codes for the release in hand. Requirements: requests, pandas, and numpy.

import requests
import pandas as pd
import numpy as np

# CMS Provider Data Catalog -- Healthcare Associated Infections - Hospital
# Sourced from the CDC National Healthcare Safety Network (NHSN) and
# published on Care Compare. No API key required for public data.
# Dataset landing page: https://data.cms.gov/provider-data/dataset/77hc-ibv8
#
# The published CSV lives at a hashed, timestamped path that changes with
# every release, so it must NOT be hard-coded. Instead, resolve the current
# download URL at runtime from the Provider Data Catalog metastore API.
DATASET_ID = "77hc-ibv8"
META_URL = (
    "https://data.cms.gov/provider-data/api/1/metastore/"
    f"schemas/dataset/items/{DATASET_ID}?show-reference-ids"
)

print("Resolving current CSV download URL from the metastore...")
meta = requests.get(META_URL, timeout=60)
meta.raise_for_status()
dist = meta.json()["distribution"]
# The distribution may store the URL under .data.downloadURL or .downloadURL.
csv_url = None
for d in dist:
    node = d.get("data", d)
    url = node.get("downloadURL")
    if url and url.lower().endswith(".csv"):
        csv_url = url
        break
if not csv_url:
    raise RuntimeError("No CSV downloadURL found in dataset metadata")
print(f"CSV: {csv_url}")

df = pd.read_csv(csv_url, dtype=str, low_memory=False)
print(f"Loaded {len(df):,} hospital-by-measure rows")

# Resolve column names defensively -- they vary slightly by release.
def col(frame, *cands):
    low = {c.lower(): c for c in frame.columns}
    for cand in cands:
        if cand.lower() in low:
            return low[cand.lower()]
    raise KeyError(f"none of {cands} present")

c_ccn   = col(df, "Facility ID", "facility_id", "Provider ID")
c_state = col(df, "State", "state")
c_meas  = col(df, "Measure ID", "measure_id")
c_score = col(df, "Score", "score")
c_cmp   = col(df, "Compared to National", "compared_to_national")

# The SIR rows for each measure carry the suffix "_SIR" in Measure ID.
# Pick the CLABSI standardized infection ratio measure.
MEASURE = "HAI_1_SIR"   # central-line-associated bloodstream infection SIR
sir = df[df[c_meas] == MEASURE].copy()
sir["sir"] = pd.to_numeric(sir[c_score], errors="coerce")
sir = sir.dropna(subset=["sir"])
print(f"{MEASURE}: {len(sir):,} hospitals with a numeric SIR")

# --- 1. Rank hospitals by SIR (worst infection performance first) ------
worst = sir.sort_values("sir", ascending=False).head(15)
print("\nHospitals with the highest CLABSI SIR:")
for _, r in worst.iterrows():
    print(f"  {r[c_ccn]}  {r[c_state]}  SIR={r['sir']:.3f}")

# --- 2. Share of hospitals worse than the national benchmark, by state -
# An SIR above 1.0 means more infections than the national baseline
# predicts for a hospital of that size and case mix.
sir["worse"] = sir["sir"] > 1.0
by_state = (
    sir.groupby(c_state)
    .agg(hospitals=("sir", "size"),
         worse=("worse", "sum"),
         median_sir=("sir", "median"))
    .reset_index()
)
by_state = by_state[by_state["hospitals"] >= 10]
by_state["pct_worse"] = (100 * by_state["worse"] / by_state["hospitals"]).round(1)
by_state = by_state.sort_values("pct_worse", ascending=False)
print("\nShare of hospitals with SIR > 1.0 by state (>=10 hospitals):")
print(by_state[[c_state, "hospitals", "pct_worse", "median_sir"]].head(15).to_string(index=False))

# --- 3. Use the CMS "Compared to National" footnote as a cross-check ----
print("\nCMS comparison-to-national distribution:")
print(sir[c_cmp].fillna("(not available)").value_counts().to_string())

Two practical notes apply. First, the dataset mixes measure types in the measure_id column: for each infection it carries not only the SIR row but separate rows for the observed-count numerator, the predicted-count denominator, and the device-days or procedures—so filtering to the right measure code is essential, and an analyst who wants to reproduce the ratio rather than trust the published scoreshould pull the observed and predicted rows and divide them directly. Second, the state-level “share worse than national” calculation deliberately drops states with fewer than ten reporting hospitals, because a percentage computed over a handful of facilities is noise; the same instinct should be applied to individual hospitals, where a low SIR resting on a tiny device-day denominator is far less meaningful than the same SIR built on a large one. The published footnotes flag the suppressed and too-few-cases rows that the numeric coercion silently turns into missing values, and a careful analysis reads those footnotes rather than treating an absent score as a clean result.

Limitations and analytical caveats

The HAI dataset is the most comprehensive public record of hospital infection performance in the United States, but it carries structural limitations that an analyst must internalize before treating an SIR as a verdict on a hospital.

Risk adjustment is a model, not a fact. The predicted count in the SIR's denominator comes from the CDC's national baseline model, and like any model it captures the risk factors it includes and misses those it does not. A hospital that serves an unusually sick or socially complex population may have genuine infection risk that the model under-predicts, making its SIR look worse than its practice warrants; conversely, the model cannot fully equalize every difference in case mix. The SIR is a far better comparison than a raw rate, but it is risk-adjusted, not risk-free, and an analyst should resist the temptation to read small SIR differences between hospitals as precise rankings of competence.

Small denominators make SIRs unstable, and CMS suppresses them. A hospital with few central-line-days or few colon surgeries has a small predicted count, and over a small base a single infection—or its absence—swings the ratio wildly. An SIR of 0.0 from a hospital that performed twenty procedures and happened to have no infection is not evidence of excellence; it is evidence of a small sample. CMS handles this partly through suppression: rows with too few predicted infections to compute a reliable ratio are footnoted and carry no numeric score. Any ranking that ignores the denominator—or treats a suppressed row as a missing data point rather than as a deliberate signal that the number would be unreliable—will systematically reward small hospitals for their smallness.

The data depends on consistent surveillance, and reporting can vary. Because HAIs are identified by infection preventionists applying surveillance definitions, the count reflects how diligently and how uniformly those definitions are applied. A hospital with rigorous, well-staffed surveillance may detect and report more infections than a less diligent one—and thereby post a higher SIR—not because it is less safe but because it is looking harder. The standardized NHSN definitions are designed to minimize this, and validation audits exist to police it, but the residual possibility that variation in surveillance intensity contributes to variation in measured infections is a genuine confound, and it cautions against reading every difference between hospitals as a difference in actual safety.

There is reporting lag and a fixed baseline. The published measures cover defined reporting periods that close well before publication, so the data describes infection performance over a past window rather than the present, and the most recent quarters are not yet reflected. The SIR is also computed against a stated CDC baseline year; when the CDC rebaselines—as it does periodically to keep the comparison meaningful as national practice improves—SIRs from before and after the change are not directly comparable, because the same observed counts are being measured against a different yardstick. A longitudinal study of a hospital's SIR over many years must account for the baseline change rather than treat the series as a continuous measurement.

Held with these caveats in mind, the cms_hospital_infections table is a uniquely valuable resource: a hospital-resolved, risk-adjusted, payment-linked record of the infections that US hospitals give their own patients—roughly 173,000 hospital-by-measure rows in which the Standardized Infection Ratio turns a count of bloodstream infections, urinary-tract infections, surgical-site infections, MRSA cases, and C. diff cases into a comparable, accountable measure of how safe it is to be a patient, and one with real Medicare dollars riding on every number.

Related writing

CMS Hospital Quality Data: Outcomes, Readmissions, and Star Ratings for 6,000 US Hospitals — The broader Care Compare quality record that the infection measures join into by CCN, letting an analyst test whether a hospital's SIRs track with its readmissions, mortality, and overall star rating.

CMS Provider Ownership: The Federal Database Behind Private Equity in Nursing Homes, Home Health, and Hospice — The ownership records that turn individual-hospital infection ratios into a system-level question, revealing whether infection performance clusters by corporate owner rather than by facility.

CMS Post-Acute Care Utilization: The Federal Database Behind Home Health, Hospice, and Skilled Nursing Spending — A companion CMS dataset built on the same provider identifiers and Medicare payment logic, extending the federal performance record from the hospital stay into the post-acute care that follows it.