Technical writing

CMS Hospital Service Area: The Federal Map of Where Every Hospital's Patients Come From

· 12 min read· AI Analytics
CMSHospital MarketsMedicarePatient FlowFederal Data

Every hospital has a gravitational field. Patients fall into it from the streets around the building, from the next town over, from the rural counties an hour's drive away— and the field thins out the farther you go until, somewhere, it gives way to the pull of a competing hospital. CMS draws the map of that field for the whole country. The Hospital Service Area file reports, for every Medicare-certified hospital, the ZIP codes its patients come from and how many cases came—roughly 1.16 million hospital-by-ZIP rows that are the federal answer to a deceptively simple question: where does each hospital's business actually come from?

This article covers what the Hospital Service Area file is and the hospital-by-patient-ZIP grain that defines it; how CMS builds it from Medicare fee-for-service inpatient claims and what the case, day, and charge measures each mean; the small-cell suppression that censors low-volume hospital-ZIP pairs and why that systematically thins out rural coverage; the concept of a catchment area and how to compute each hospital's top origin ZIPs and home-ZIP share; the use of the file to define geographic hospital markets and the central role it plays in hospital-merger antitrust review; the Dartmouth Atlas tradition of hospital service areas and referral regions that this data descends from; how the file joins to the Provider of Services file and the broader CMS hospital datasets through the CMS Certification Number; a Python workflow that pulls the file from the data.cms.gov API and computes a hospital catchment and a ZIP-level market-concentration measure; and the caveats—the Medicare-only lens, the suppression bias, and the difference between a patient-flow map and a true competitive market—that every analyst must hold in mind.

What the dataset is

The Hospital Service Area file is a public CMS dataset that describes, for every Medicare-certified hospital in the United States, the geographic origin of the patients it treats. It is, at bottom, a crosswalk: it pairs each hospital with each ZIP code that any of its patients lived in, and it records how much of the hospital's volume came from that ZIP. It does not describe what happens inside the hospital—not quality, not outcomes, not the conditions treated—but rather the spatial relationship between hospitals and the populations they serve. In the language of health-services research, it is a patient-flow dataset: it traces the flow of patients from where they live to where they are admitted.

In our database this is the table cms_hospital_service_area, and its grain is the single most important thing to understand about it: one row per hospital per patient ZIP code. A large urban hospital that draws Medicare patients from four hundred different ZIP codes contributes four hundred rows; a small rural hospital that draws from a dozen contributes a dozen. Across all hospitals and all the ZIP codes that feed them, the file comes to roughly 1.16 million rows. Each row answers, for one hospital-and-ZIP combination, three quantitative questions—how many inpatient cases (discharges) from that ZIP went to that hospital, how many inpatient days those cases ran, and what they were charged—keyed to the hospital by its CMS Certification Number:

medicare_prov_num     -- CMS Certification Number (CCN) of the hospital
zip_cd_of_residence   -- patient ZIP code of residence (the origin)
total_cases           -- number of inpatient cases (discharges) from the ZIP
total_days            -- total inpatient days of care for those cases
total_charges         -- total submitted charges for those cases (dollars)
                      --   total_cases and total_charges are suppressed
                      --   when total_cases is fewer than 11, for privacy

Two columns are the keys. The medicare_prov_num is the CMS Certification Number (CCN), the persistent six-character identifier CMS assigns to every facility it certifies to participate in Medicare; it is the same identifier used across the hospital cost reports, the quality datasets, and the Provider of Services file, which is what makes the Hospital Service Area file joinable to everything else CMS knows about the hospital. The zip_cd_of_residence is the patient's ZIP code of residence—the origin of the flow, not the hospital's location. The remaining columns are the measures: total cases counts inpatient discharges, total days sums the days of care across those cases, and total charges is the dollar amount the hospital submitted (gross charges, not what Medicare actually paid). Note what the file does not carry: there is no separate count of distinct beneficiaries, so the file measures admissions (cases), not people. One chronically ill patient admitted three times in a year contributes three cases, and the file has no way to tell that apart from three different patients admitted once each—cases are episodes, not individuals.

How it is built: Medicare claims aggregated to a grain

The Hospital Service Area file is not collected from hospitals as a survey—it is derived from claims. Specifically, it is built from Medicare fee-for-service inpatient claims: the bills that hospitals submit to Medicare for the care of beneficiaries enrolled in traditional, non-managed-care Medicare. Each inpatient claim already carries the two facts the file needs: the CCN of the hospital that submitted it, and the beneficiary's ZIP code of residence. CMS takes a calendar year's worth of those claims and aggregates them up—collapsing millions of individual admissions into the hospital-by-ZIP summary—counting the discharges, and summing the days of care and the charges in each hospital-and-ZIP cell.

The reliance on fee-for-service claims is the dataset's defining methodological characteristic, and it cuts two ways. On one hand it is what makes the file possible and authoritative: claims are the exhaust of the payment system, so every reimbursed inpatient admission is captured without the file having to ask anyone for it. On the other hand it is the source of the file's biggest blind spot. A large and growing share of Medicare beneficiaries are enrolled not in traditional fee-for-service Medicare but in Medicare Advantage, the privately administered managed care alternative. Medicare Advantage encounters do not flow through the fee-for-service claim stream in the same way, so beneficiaries in those plans are largely absent from the Hospital Service Area file. The map the file draws is therefore the patient flow of fee-for-service Medicare specifically—a large, important, but partial slice of each hospital's actual patient population, and one whose representativeness varies geographically as Medicare Advantage penetration varies from market to market.

One further consequence of building the file from claims is that it inherits the claim's notion of geography. The patient's ZIP of residence is the ZIP recorded on the beneficiary's enrollment record at the time of the claim—a snapshot that may lag a recent move, and that is a ZIP, not a precise address. ZIP codes are not uniform in size or population; a single rural ZIP can span an enormous, sparsely populated area, while a dense urban ZIP covers a few blocks. Any spatial analysis built on the file—travel distance, market boundaries, access—is therefore working at ZIP resolution, with all the coarseness and irregularity that the ZIP geography imposes.

Small-cell suppression and the rural blind spot

The single most important data-handling fact about the Hospital Service Area file is that it is suppressed to protect patient privacy. Because the file is built from individual beneficiaries' claims and reported at a fine hospital-by-ZIP grain, a cell with very few patients could, in principle, be combined with other public information to re-identify an individual—the one person from a tiny rural ZIP who was admitted to a particular hospital. To prevent that, CMS applies small-cell suppression: in the current rule it censors the total cases and total charges for any hospital-ZIP row whose case count is fewer than eleven, so that no hospital-ZIP pair representing a handful of identifiable admissions is published with usable counts.

This privacy protection is necessary and correct, but it imposes a structural bias that every analyst must internalize: suppression is not random—it is concentrated exactly where volumes are low. The hospital-ZIP pairs most likely to be suppressed are the small ones: a few patients from a distant rural ZIP traveling to a regional referral center, the long thin tail of a hospital's catchment, the entire patient flow of a tiny critical-access hospital in a sparsely populated county. The result is that rural and low-volume hospital-ZIP relationships are systematically missing from the file. The dense, high-volume core of each hospital's catchment—the nearby ZIPs that send dozens or hundreds of cases—is reported in full; the sparse periphery is censored away.

The analytic implications are real and easy to get wrong. Summing a hospital's reported cases will understate its true volume, because the suppressed tail is not counted. Measuring how far patients travel will understate distance, because it is precisely the far, thin, long-distance flows that get suppressed. Mapping a hospital's catchment will draw it too tight, missing the wide rural reach of referral centers. And any analysis that compares urban and rural hospitals, or that tries to characterize access in sparsely-populated areas, runs directly into the fact that the rural data is the data most likely to be absent. The suppression does not make the file wrong—it makes it incomplete in a predictable direction, and analyses must be built to respect that direction rather than treating the published cells as the whole story.

Catchment areas: top origin ZIPs and home-ZIP share

The most natural use of the file is to describe each hospital's catchment area—the geographic territory from which it draws its patients. Because each row carries a hospital, a patient ZIP, and a volume, a hospital's catchment is simply the set of ZIPs that appear with it, weighted by how many cases each contributes. Sorting a hospital's ZIPs by case count and reading off the top of the list gives its top origin ZIPs: the neighborhoods and towns that send it the most patients, which are almost always the ZIPs nearest the building.

Two simple summary statistics capture most of what is interesting about a catchment. The first is the home-ZIP share—the fraction of a hospital's total cases that come from the single ZIP in which the hospital itself sits, or more usefully from a small ring of immediately surrounding ZIPs. A high home-ZIP share describes a local, neighborhood hospital whose business is overwhelmingly the people next door; a low home-ZIP share, with volume spread thinly across many distant ZIPs, describes a regional referral center or a specialized facility that pulls patients from across a wide area. The second is the concentration of the catchment: how few ZIPs it takes to account for, say, half or three-quarters of a hospital's cases. A community hospital might reach seventy-five percent of its volume in five or ten ZIPs; an academic medical center might need fifty. Together these statistics turn the raw rows into a compact characterization of a hospital's spatial role—local workhorse versus regional magnet—that can be compared across thousands of facilities.

The same data read in the other direction—fixing a ZIP and looking at the hospitals its residents go to—answers the patient's question rather than the hospital's: where do the people of this place actually get their inpatient care, and how is their volume split among competing hospitals? That inversion is the foundation of both the market-definition use that follows and the Dartmouth Atlas tradition of building service areas from the ground up by assigning each ZIP to the hospital its residents use most.

Defining hospital markets and antitrust review

The highest-stakes use of the Hospital Service Area file is defining the geographic market for a hospital, and the setting where that definition matters most is antitrust review of hospital mergers. When two hospitals propose to merge, the central legal question is whether the combination would substantially lessen competition—and that question cannot be answered without first defining the relevant market: the geographic area within which the merging hospitals actually compete for patients. Define the market too narrowly and a harmful merger looks benign; define it too broadly and a benign one looks harmful. The patient-flow data is the empirical anchor that disciplines the definition.

The reasoning runs through patient origins. If two hospitals draw their patients from overlapping sets of ZIP codes, they are competing for the same population and a merger combines two competitors; if they draw from largely disjoint areas, they may not meaningfully compete at all. Antitrust economists use the patient-flow data to construct geographic markets empirically—classically by methods in the lineage of the Elzinga-Hogarty approach, which examines how much of a candidate market's patients are treated outside it and how many outside patients flow in, to test whether the area is a self-contained market. More recent merger analysis has moved beyond simple flow tests to demand-estimation models, but in every approach the Hospital Service Area file's hospital-by-ZIP volumes are a primary input: they are the raw evidence of who competes with whom for which neighborhoods.

Once a market is defined, its competitiveness is summarized with a concentration measure, most commonly the Herfindahl-Hirschman Index (HHI)—the sum of the squared market shares of the hospitals in the market, scaled from near zero (perfect competition) to ten thousand (a monopoly). Computed from patient-flow shares, the HHI is the number that tells a regulator whether a market is already concentrated and how much a proposed merger would concentrate it further; conventional thresholds flag markets above twenty-five hundred as highly concentrated. The Hospital Service Area file is the data behind these numbers—not only in the courtroom, but in the academic literature documenting the decades-long rise in hospital-market concentration and its association with higher prices. The same file that a hospital's strategy office uses to plan a new clinic is the file an antitrust economist uses to argue that the hospital should not be allowed to buy its rival.

The Dartmouth Atlas tradition

The intellectual lineage of this file runs through the Dartmouth Atlas of Health Care, the body of work that, beginning in the 1990s, used Medicare claims to map how medical care is delivered across the United States and to document the vast, unexplained geographic variation in how much care different regions consume. The Dartmouth project needed units of geography that reflected how patients actually use hospitals rather than arbitrary administrative boundaries like counties or states—and so it built those units from patient-flow data of exactly the kind the Hospital Service Area file contains.

Two Dartmouth constructs became standard vocabulary. A hospital service area (HSA) is a local region built by assigning each ZIP code to the hospital (or cluster of hospitals) where its residents receive most of their inpatient care—a bottom-up, patient-revealed definition of a local hospital market. A hospital referral region (HRR) is a larger region, aggregating HSAs, defined by where patients go for major tertiary care such as cardiovascular surgery and neurosurgery—the catchment of the big referral centers. These regions are not drawn on a map by a committee; they are discovered from the data, by following patients from their ZIP of residence to the hospital they chose. The Hospital Service Area file is the federal, refreshed-annually expression of precisely that methodology—the same hospital-to-patient-ZIP flows that let the Dartmouth researchers carve the country into service areas and referral regions, now published directly as a CMS dataset that any analyst can rebuild those geographies from.

Joining to the CCN and the broader CMS hospital data

The Hospital Service Area file is thin by design—it carries patient origins and volumes, and almost nothing about the hospital itself. Its value multiplies when it is joined, on the CMS Certification Number, to the other CMS datasets that describe what each hospital is and how well it performs. The CCN is the universal key, and it unlocks several joins.

The first join is to the Provider of Services file and the hospital enrollment data, which supply each CCN's name, street address, ownership type, bed count, and the categorical facts—is this a short-term acute-care hospital, a critical-access hospital, a children's hospital, a psychiatric facility—needed to interpret a catchment. Knowing that a CCN with an enormous, far-flung catchment is an academic medical center, while one with a tight local catchment is a fifteen-bed critical-access hospital, is what turns a flow pattern into an understanding of the hospital's role. Crucially, the Provider of Services file also supplies the hospital's own location, which the Hospital Service Area file itself does not carry—and the hospital's ZIP is what lets an analyst compute the distance from each patient ZIP to the hospital, the foundation of every travel-distance and access study.

The second family of joins is to the CMS hospital quality and utilization datasets—the outcomes, readmissions, and star ratings; the healthcare-associated infection measures; the cost reports. Joining patient flow to quality lets an analyst ask whether the patients flowing into a hospital are flowing toward better or worse care, and whether populations in particular ZIPs are systematically routed to lower-rated facilities. Joining patient flow to spending and post-acute utilization connects the front door of the inpatient stay to what happens after discharge. In every case the pattern is the same: the Hospital Service Area file supplies the where the patients come from, and a join on the CCN supplies the what the hospital is and does, and only together do they answer questions about access, equity, competition, and quality that neither dataset can answer alone.

Analytical uses

A national, hospital-resolved, ZIP-resolved map of where Medicare patients are admitted supports a distinctive family of analyses that hospital-level summaries alone cannot.

Market definition and concentration is the marquee use, described above: building empirical hospital markets from patient flows and computing the HHI to assess competitiveness, whether for a merger review, a state-level study of consolidation, or longitudinal tracking of how concentrated US hospital markets have become. Closely related is competitive intelligence: a hospital's own strategy office reads its catchment to see which ZIPs it dominates, which it is losing to rivals, and where an under-served area might justify a new outpatient site or service line.

Access and travel-distance studies use the file, joined to hospital locations, to measure how far patients travel for inpatient care and to identify populations—often rural—for whom the nearest hospital is far away or for whom a hospital closure would lengthen the trip to dangerous distances. The suppression caveat bites hardest here, because the long-distance flows are the suppressed ones, but for the volumes that are reported the file is the best national evidence of who travels how far. Equity analysis brings demographics to bear: by joining patient ZIPs to census characteristics, an analyst can ask whether residents of poorer or more-segregated ZIPs are routed to different—and lower-quality—hospitals than their neighbors, the patient-flow signature of structural inequity in access. And health-services research at large uses the file to build the Dartmouth-style service areas and referral regions that countless studies of geographic variation, utilization, and spending rest on.

Python workflow: catchment and concentration from the data.cms.gov API

The script below pulls the Hospital Service Area file from the data.cms.gov data API, then computes two of the core metrics: each hospital's catchment—its top origin ZIPs and the share of its cases from each—and a simple market-concentration measure, the Herfindahl-Hirschman Index of each origin ZIP's patients across the hospitals that serve them. No API key is required for the public data. Because the file is published as a versioned dataset whose UUID changes with each annual release, the script isolates the dataset UUID in one place to be resolved from the current dataset landing page, and because the column names vary slightly between releases it discovers the working CCN, ZIP, and case column names at runtime rather than hard-coding them.

import requests, math
import pandas as pd
from collections import defaultdict

# CMS data.cms.gov data API -- public, no key required.
# The Hospital Service Area file is published as a versioned dataset on
# the CMS data portal. Each annual release has its own dataset UUID; the
# data API serves rows as JSON from a stable per-dataset path:
#
#   https://data.cms.gov/data-api/v1/dataset/{UUID}/data
#
# UUIDs change with each yearly release, so resolve the current one from
# the dataset landing page rather than hard-coding it here.
DATASET_UUID = "REPLACE_WITH_CURRENT_HOSPITAL_SERVICE_AREA_UUID"
BASE = f"https://data.cms.gov/data-api/v1/dataset/{DATASET_UUID}/data"


def fetch_all(size=5000):
    # The data API paginates with offset + size. Column names vary
    # slightly by release; recent files use MEDICARE_PROV_NUM (the CCN),
    # ZIP_CD_OF_RESIDENCE, TOTAL_CASES, TOTAL_DAYS, and TOTAL_CHARGES.
    rows, offset = [], 0
    while True:
        r = requests.get(BASE, params={"size": size, "offset": offset},
                         timeout=120)
        r.raise_for_status()
        batch = r.json()
        if not batch:
            break
        rows.extend(batch)
        offset += size
    return pd.DataFrame(rows)


def _col(df, *needles):
    # Return the first column whose name contains all of the needles.
    for c in df.columns:
        if all(n.upper() in c.upper() for n in needles):
            return c
    return None


def analyze(df):
    ccn   = _col(df, "PROV") or _col(df, "CCN")
    zipc  = _col(df, "ZIP")
    cases = _col(df, "TOTAL", "CASES") or _col(df, "CASES")
    df[cases] = pd.to_numeric(df[cases], errors="coerce").fillna(0)

    # --- 1. Catchment: each hospital's top origin ZIPs ----------------
    for cur_ccn, g in df.groupby(ccn):
        g = g.sort_values(cases, ascending=False)
        total = g[cases].sum()
        top = g.head(5)
        print(f"\nHospital {cur_ccn}: {int(total):,} cases from "
              f"{g[zipc].nunique():,} ZIPs")
        for _, row in top.iterrows():
            share = row[cases] / total if total else 0
            print(f"   ZIP {row[zipc]}: {int(row[cases]):>6,} "
                  f"({share:.1%})")
        break  # demo: first hospital only

    # --- 2. Market concentration by patient ZIP (a simple HHI) --------
    # For each origin ZIP, how concentrated are its patients across the
    # hospitals that serve it? HHI = sum of squared market shares,
    # scaled 0..10000. >2500 is conventionally "highly concentrated".
    hhi = {}
    for zcode, g in df.groupby(zipc):
        total = g[cases].sum()
        if total <= 0:
            continue
        shares = (g[cases] / total) * 100.0
        hhi[zcode] = float((shares ** 2).sum())
    hhi_s = pd.Series(hhi).sort_values(ascending=False)
    concentrated = (hhi_s > 2500).sum()
    print(f"\nOrigin ZIPs analyzed: {len(hhi_s):,}")
    print(f"Highly concentrated ZIPs (HHI > 2500): "
          f"{concentrated:,} ({concentrated / max(len(hhi_s),1):.1%})")
    return df


df = fetch_all()
print(f"Hospital-by-ZIP rows loaded: {len(df):,}")
analyze(df)

Two practical notes apply. First, the HHI computed here is deliberately a simple illustration: it measures, for each origin ZIP, how concentrated that ZIP's reported Medicare cases are across the hospitals serving it—a ZIP-level rather than a fully market-level Herfindahl. A rigorous merger-style analysis must first define the relevant market (a cluster of ZIPs and the hospitals competing for them), not assume that a single ZIP is a market, and must reckon with the cross-flows the Elzinga-Hogarty logic was built to handle; the per-ZIP HHI is a useful screening signal, not the regulatory measure itself. Second, the suppression of small cells is silent in the data—a suppressed hospital-ZIP pair's case and charge values simply are not published—so every sum, share, and index the script computes is over the reported volume only. For national-scale work, CMS also publishes the file as a flat downloadable extract, which is far more efficient than paging the full ~1.16 million rows through the API and ships with the authoritative column definitions for the release.

Limitations and analytical caveats

The Hospital Service Area file is the best public map of where US hospital patients come from, but it carries structural limitations that an analyst must internalize before drawing conclusions from it.

It is Medicare fee-for-service only. The file sees the patient flow of traditional fee-for-service Medicare and largely misses Medicare Advantage enrollees, who form a large and growing share of the Medicare population, as well as every patient under sixty-five—the commercially insured, Medicaid beneficiaries, and the uninsured—who do not appear at all. A hospital's Medicare catchment is a real and important thing, but it is not the same as its total catchment, and the gap between the two varies by hospital and by market depending on payer mix and Medicare Advantage penetration. An analysis that treats the file's flows as the hospital's whole patient population, or that compares markets without accounting for differing Medicare Advantage shares, will draw conclusions the data cannot support.

Small-cell suppression biases the periphery. As laid out above, the cells censored for privacy are the low-volume ones, which are disproportionately the distant, rural, long-tail hospital-ZIP relationships. Reported volumes understate true volumes, reported travel distances understate true distances, and reported catchments are drawn too tight—all in the same direction. The bias is predictable, which means it can be reasoned about, but it cannot be ignored, and it is most severe in exactly the rural settings where access questions are most acute.

Patient flow is not the same as a competitive market.The file records where patients went, which is the starting point for defining a market but not the market itself. Patients travel for many reasons—referral patterns, physician affiliations, insurance networks, the location of a specialty service—that are not the same as the competitive choice an antitrust analysis cares about, and observed flows can both understate competition (patients who could have chosen a rival but did not) and overstate it (flows driven by referral or network constraints rather than competition). Market definition is an inferential exercise that uses the flow data as evidence, not a fact read directly off the file, and the methodological choices in that inference—Elzinga-Hogarty, demand estimation, the treatment of cross-flows—materially change the answer.

It counts cases, reports charges not payments, and uses ZIPs, not addresses. The file is released for a given calendar year and lags real time; patient flows shift as hospitals open, close, and change service lines, so a current strategy or access decision should not lean on an old release. Its volume measure is cases (discharges), not distinct people, so it cannot distinguish many admissions of one patient from one admission each of many patients. The dollar figure is total charges—what the hospital billed—not what Medicare paid, and hospital charges bear only a loose relationship to actual payments, so the charge column should never be read as revenue or cost. And the geography is the beneficiary's ZIP of residence—coarse, irregular in size, and a point-in-time snapshot—so every spatial inference inherits the limits of ZIP-level data.

Held with these caveats in mind, the cms_hospital_service_area table is a uniquely valuable resource: a national, hospital-resolved, ZIP-resolved record of where the country's Medicare patients are admitted—the federal map of hospital catchments and markets that turns the abstract question of who competes with whom, and who can reach what care, into roughly 1.16 million concrete rows tying every hospital to the neighborhoods its patients come from.

Related writing

CMS Hospital Quality Data: Outcomes, Readmissions, and Star Ratings for 6,000 US Hospitals — Joining patient flow to quality is the natural next step: the service-area file tells you which ZIPs a hospital draws from, and the quality data—keyed by the same CCN—tells you whether the patients flowing in are flowing toward better or worse outcomes.

CMS Healthcare-Associated Infections: The Federal Record of CLABSI, CAUTI, MRSA, and C. diff in US Hospitals — Another CCN-keyed hospital dataset that pairs with patient flow to ask whether residents of particular ZIPs are routinely routed to facilities with worse infection records, the safety dimension of the access question.

CMS Post-Acute Care Utilization: The Federal Database Behind Home Health, Hospice, and Skilled Nursing Spending — Where the service-area file maps the front door of the inpatient stay, the post-acute utilization data follows the same Medicare beneficiaries out the back door into home health, hospice, and skilled nursing, completing the geography of an episode of care.