Behind every glass of tap water in America there is a system—a city utility serving millions, a rural water association run from a pickup truck, a school that pumps its own well, a highway rest stop with a single chlorinator. EPA keeps a census of all of them: roughly 150,000 active public water systems, drawn from an inventory file of around 400,000 records once deactivated systems are counted, one row per system, each keyed to a federal identifier and tagged with what kind of system it is, how many people it serves, where its water comes from, and which state regulates it. It is the spine of the entire drinking-water program—the inventory that the violations and the inspection records all hang off of.
This article covers what the public water system inventory is and how the Safe Drinking Water Act defines the systems it contains; the statutory threshold—twenty-five people or fifteen service connections—that decides what counts as a public water system at all; the three system types (community, non-transient non-community, and transient non-community) and what the classification governs; the five size categories by population served and the dramatic skew between the count of systems and the population they serve; the source-water distinction between surface water and groundwater and why it matters; the role of state primacy agencies in maintaining the inventory; how the epa_sdwa_water_systems table anchors the violations and site-visit datasets through the PWSID; a Python workflow that pulls the inventory from the SDWIS API and weighs system counts against population served; and the caveats—state reporting variation, inactive-system clutter, and population-estimate imprecision—that every analyst must internalize first.
What the dataset is
The Safe Drinking Water Information System, universally abbreviated SDWIS, is EPA's national database for the public drinking-water program. It holds several distinct records: the contaminant monitoring results that systems report, the violations they incur, the enforcement actions taken against them, the on-site sanitary surveys conducted at them, and—the foundation on which all of those sit—the inventory of public water systems themselves. The inventory is the registry of every system EPA and the states regulate: who they are, what type, how big, what their water source is, and who oversees them. Surfaced through EPA's ECHO platform and the SDWIS public data downloads, the inventory comprises roughly 400,000 system records once both active and inactive systems are counted—of which on the order of 150,000 are active public water systems operating today.
In our database this record is stored as the table epa_sdwa_water_systems, with the grain of one row per public water system: a single PWSID, the identity of one system, one row. Where the violations and site-visit tables have many rows per system—one per violation, one per inspection—the inventory has exactly one, which makes it the natural dimension table against which the event tables are measured. The columns capture who the system is, what it is, how big it is, where its water comes from, and who regulates it:
pwsid -- public water system ID (state code + 7 digits)
pws_name -- the system's name (utility, district, facility)
primacy_agency_code -- the state / primacy agency that regulates it
pws_type_code -- CWS, NTNCWS, or TNCWS (the three system types)
pws_activity_code -- A = active; inactive systems remain on file
population_served_count -- estimated retail population the system serves
service_connections_count -- number of service connections
primary_source_code -- GW, SW, GU, SWP, GWP (ground vs surface water)
owner_type_code -- federal, state, local, private, native
gw_sw_code -- groundwater vs surface-water rollup
state_code -- the state in which the system operates
is_school_or_daycare -- flag for school / daycare systemsThe pwsid is the load-bearing column. The public water system identification number is a persistent identifier—a two-character state or primacy-agency code followed by seven digits—assigned to every regulated public water system. It is the key that ties the inventory record to the same system's monitoring results, its sanitary surveys, and, most importantly for analysis, its violations. Because the PWSID appears on every SDWIS table, the inventory row in epa_sdwa_water_systems can be joined directly to every event the same system ever generated. The other columns are what give those events meaning. The pws_type_code records the three-way system classification; the population_served_count supplies the denominator for almost every per-capita calculation; the primary_source_code and gw_sw_code distinguish surface water from groundwater; and the primacy_agency_code names the state program that owns the record. Without the inventory, a violation or an inspection is an anonymous event; with it, every event is anchored to a system of known size, type, source, and regulator.
What it is and the SDWA regulatory frame
The Safe Drinking Water Act (SDWA) was enacted in 1974 as the federal government's answer to a basic public-health gap: the country had no national, enforceable standards for the safety of tap water delivered to the public. Before the SDWA, drinking-water standards were a patchwork of non-binding federal guidance and uneven state rules. The 1974 Act gave EPA authority to set enforceable national standards for contaminants in drinking water and to require the systems that deliver water to the public to monitor for those contaminants, treat the water to meet the standards, and notify the public when they fail. But to do any of that, the program first needed to know which systems existed and were therefore covered—and so the public water system inventory is, in a real sense, the program's starting point: the list of everyone the rules apply to.
The standards EPA sets under this authority are the National Primary Drinking Water Regulations (NPDWRs), each of which sets, for a regulated contaminant, either an enforceable maximum contaminant level or a required treatment technique, along with monitoring and reporting schedules. Which of these regulations a given system must meet depends on what the inventory says the system is. A community system serving a town year-round is held to the full battery of NPDWRs; a highway rest stop is held only to the rules addressing acute, short-term risk. The inventory's classification fields are therefore not mere description—they are the switch that determines a system's entire regulatory obligation, which is why the three system types in the next section carry so much weight.
The most important structural fact about the SDWA, and the one that explains the shape of the inventory data, is primacy. EPA writes the national regulations, but it does not, in the ordinary course, run the program on the ground. Instead, states (and a number of tribes and territories) apply for and receive primary enforcement responsibility—primacy— to administer the drinking-water program within their borders. To obtain primacy a state must adopt regulations at least as stringent as the federal NPDWRs and demonstrate the capacity to enforce them; nearly all states hold it. The practical consequence is that the state primacy agency maintains the inventory: it is a state drinking-water program, not EPA, that issues the PWSID, records the system type, estimates the population served, and classifies the source water, then reports those details up to EPA, where they populate SDWIS. This is why the dataset is keyed by a primacy_agency_code and why, as the caveats section will stress, the completeness and consistency of the inventory vary from one primacy agency to the next: the federal census is an aggregation of fifty-odd state registries reporting in.
What counts as a public water system
The SDWA does not regulate all water—it regulates public water systems, and the statutory definition is precise and consequential. A public water system is a system that provides water for human consumption through pipes or other constructed conveyances to at least fifteen service connections or that regularly serves an average of at least twenty-five people for at least sixty days a year. Everything in the inventory clears that threshold; everything below it is invisible to the program. A private household well serving a single family is not a public water system and does not appear in SDWIS—a point that matters because tens of millions of Americans drink from such private wells, entirely outside the federal drinking-water safety net the inventory describes.
The twenty-five-people-or-fifteen-connections threshold is what makes the inventory so large and so long-tailed. It is low enough to sweep in an enormous number of very small operations that most people would never think of as “water utilities”: a church camp, a mobile-home park, a roadside diner, an isolated factory, a small school, a vineyard tasting room with a well. Each of these, the moment it regularly serves twenty-five people, becomes a regulated public water system with a PWSID, monitoring obligations, and a place in the inventory. The result is a dataset whose sheer count—roughly 150,000 active systems, sitting inside an inventory file of around 400,000 records—is dominated not by the cities everyone pictures but by the vast undergrowth of tiny systems, and understanding that skew is the single most important interpretive move an analyst can make with this data.
The three system types
Within the universe of public water systems, the regulations draw a consequential three-way distinction recorded in the pws_type_code, based on the population a system serves and how consistently it serves them. The distinction is not bookkeeping: it determines which NPDWRs each system must meet, because the health risk—and therefore the regulatory intensity—differs sharply by type.
Community water systems (CWS) serve the same population year-round—the municipal utilities, water districts, mobile-home parks, and subdivisions that supply people's homes. They are the systems of greatest concern because their customers drink the water every day for years, so chronic, low-level exposure to contaminants like lead, arsenic, nitrate, disinfection byproducts, and now PFAS accumulates. Community water systems are subject to the full battery of NPDWRs and the most thorough oversight. Although they are far from the most numerous type, they serve the large majority of Americans, because they include every city and town system. Non-transient non-community water systems (NTNCWS)serve at least twenty-five of the same people for at least six months of the year—but not as their residence. The archetypes are schools, factories, office buildings, and hospitals that operate their own well. Because the same people (children at a school, workers at a plant) drink the water repeatedly over long periods, NTNCWSs are held to most of the chronic-exposure standards that apply to community systems. Transient non-community water systems (TNCWS) serve transient populations—different people who do not stay long: highway rest stops, campgrounds, gas stations, restaurants, and parks with their own water source. Because no individual is exposed for long, transient systems are regulated only for the contaminants that pose an acute, short-term risk—principally microbial pathogens and nitrate. They are typically the most numerous type by count while serving the smallest share of the population, which is exactly the inversion the next section explores.
Size categories and the long-tail skew
Cutting across the three types is a size category based on the population served, recorded in the inventory's population_served_count. EPA and the states commonly bin systems into very small (serving 500 or fewer people), small (501–3,300), medium (3,301–10,000), large (10,001–100,000), and very large (more than 100,000). This is the field that does the most analytic work, because the distribution of systems across these classes is extraordinarily lopsided.
The central fact of the inventory—the thing the Python workflow at the end of this article is built to expose—is the divergence between the count of systems and the population those systems serve. The overwhelming majority of public water systems are very small or small: the long tail of rural water associations, campground wells, and single-school systems makes up most of the roughly 150,000 active records. Yet that enormous number of tiny systems collectively serves only a modest fraction of the American public. At the other end, a comparatively tiny number of large and very large community systems—the big city utilities—serve the bulk of the population. A handful of systems serve most people; most systems serve very few. This is not a quirk; it is the defining structure of the American drinking-water landscape, and it has a profound consequence for compliance: the small systems, which are most numerous, are also the ones that struggle most—a part-time operator, no dedicated compliance staff, no economies of scale to fund treatment upgrades. The tail of tiny systems is where the violations concentrate even though the population at stake per system is small.
This skew is why any analysis must hold two numbers in view at once. Ranking states or systems on raw counts measures the shape of the inventory—mostly how many small systems a state happens to contain—while weighting by population measures impact on people. A state with thousands of tiny groundwater systems and a state with a few large surface-water utilities can serve identical populations and look completely different in a count-based view. The honest framing always pairs the count with the population: how many systems fall in each class, and what share of the public each class actually serves. That pairing is the headline metric the script computes, and it is the lens through which every other question about the inventory should be read.
Source water: groundwater versus surface water
The inventory's source-water fields—primary_source_code and the simpler gw_sw_code rollup—record where a system's water comes from, and the distinction is more than descriptive. The fundamental split is between groundwater (GW), drawn from wells and aquifers, and surface water (SW), drawn from rivers, lakes, and reservoirs, with intermediate codes for groundwater under the direct influence of surface water (GU) and for systems that purchase already-treated water from another system (the purchased variants). Most public water systems by count are groundwater systems—a well is how a small rural operation supplies itself—while a large share of the population is served by surface-water systems, because the big city utilities tend to draw from rivers and reservoirs.
The source type drives the regulatory regime. Surface water is more exposed to microbial contamination, runoff, and upstream discharges, so surface-water systems are subject to the Surface Water Treatment Rules—filtration and disinfection requirements that groundwater systems generally are not. Groundwater systems, by contrast, contend more with naturally occurring contaminants leached from rock (arsenic, radionuclides, nitrate from agricultural infiltration) and with the vulnerability of their wellheads. Knowing a system's source is therefore the first thing an analyst needs to interpret its violation profile: a treatment-technique violation means something very different for a surface-water system than for a groundwater one, and source-water contamination questions —a well downgradient of a hazardous-waste site, an intake downstream of an industrial discharger—begin with this field. The source code is also where the cross-program joins to other EPA datasets gain their footing, because the contamination threats to a system's source live in the air, water, and waste databases of other programs.
How the inventory anchors the violations and site-visit data
The inventory's defining role is as the dimension tableof the entire SDWIS record. On its own it is a census—informative, but static. Its real power is that every other drinking-water dataset is keyed to the same PWSID and is therefore meaningless without it. Two joins matter most.
The first is to the violations record. The SDWA violations dataset holds one row per violation, keyed by PWSID, with the violation type (health-based maximum-contaminant-level and treatment-technique violations versus monitoring and reporting violations), the contaminant or rule violated, the compliance period, and the return-to-compliance status. A violations table on its own is a stream of bare codes; joined to the inventory it becomes interpretable. The inventory supplies the population served needed to ask how many people a violation actually affected, the system type needed to separate the chronic-exposure systems from the transient ones, and the source code needed to read a treatment violation correctly. The single most cited finding about SDWA compliance—that violations concentrate overwhelmingly in the small and very small systems—is not visible in the violations data alone. It only appears when the violations are joined to the inventory's size class, which is exactly why the inventory is the table the violations data hangs off of.
The second join is to the site-visit record, the sanitary surveys and follow-up inspections that primacy agencies conduct at systems, also keyed by PWSID. The inventory is what lets an analyst normalize inspection frequency by system type and size—to ask not just how many visits a state recorded, but whether the community systems are being surveyed on their required three-year cadence and whether the tiny transient systems are being reached at all. And because the inventory, the site visits, and the violations all share the PWSID, the three can be assembled into a single system-resolved picture: a system of known size and source, the inspections that found its deficiencies, and the violations those deficiencies foreshadowed. Beyond SDWIS, the same PWSID and the system's location let the inventory connect to EPA's cross-program facility data—relating a public water system to the RCRA hazardous-waste handlers, the toxic-release reporters, and the Clean Water Act dischargers whose releases threaten its source. The inventory is the hub; everything else is a spoke.
Analytical uses
A national, system-resolved census of drinking-water systems supports a distinctive set of analyses that the event datasets cannot produce alone.
Mapping the drinking-water landscape is the most basic use: counting systems by type, size, and source, and—crucially—contrasting those counts with the population each class serves, to show how a small handful of large systems serves most Americans while an enormous tail of tiny systems serves the rest. This is the foundational picture every other drinking-water analysis assumes. Compliance burden by system size joins the inventory to the violations data to demonstrate the concentration of violations in the small-system tail, normalized by the number of systems and the population at risk, which is the empirical basis for the policy focus on small-system capacity.
Population exposure estimates use the population_served_count as the denominator to translate violation and deficiency findings into people affected—the number that matters for public health and for prioritizing intervention. Source-vulnerability screeninguses the source-water fields together with cross-program facility data to identify systems drawing from sources near industrial, waste, or military contamination, the analysis the 2024 national PFAS standard now makes urgent. And infrastructure and equity targeting combines a system's size, source, and oversight history with the demographics of the community it serves to surface the small, under-resourced systems serving the people least able to absorb a contamination event or a rate increase to fix it—exactly the population the Drinking Water State Revolving Fund and small-system assistance programs exist to reach.
Python workflow: the inventory from the SDWIS API
The script below pulls the public water system inventory for a state from EPA's Envirofacts/SDWIS REST service, restricts it to active systems, breaks the systems down by type and by size class, and—the headline comparison—weighs the count of systems in each size class against the population that class serves, exposing the long-tail skew directly. No API key is required for public data. Because SDWIS extract column names vary between releases, the script discovers the working type, population, and source column names at runtime rather than hard-coding them; any production use should be validated against the current SDWIS metadata catalog and should page through the full result set for large states.
import requests, pandas as pd
# EPA SDWIS / Envirofacts REST service -- no API key required for public data.
# The WATER_SYSTEM table is the inventory: one row per public water system,
# keyed by PWSID, with the system type, the population served, the primary
# source type, and the primacy agency. This script pulls a state’s systems,
# breaks them down by type and size class, and -- the key comparison --
# weighs the COUNT of systems in each size class against the POPULATION
# that class serves, exposing the long-tail skew of the inventory.
SDWIS = "https://data.epa.gov/efservice"
def _rows(table, col, op, val, fmt="JSON", page=100000):
# Envirofacts path grammar: /TABLE/COLUMN/OPERATOR/VALUE/FORMAT/rows/START:END
path = f"{SDWIS}/{table}/{col}/{op}/{val}/{fmt}/rows/0:{page}"
r = requests.get(path, timeout=180)
r.raise_for_status()
return r.json()
def _find(cols, *needles):
# Return the first column whose name contains all of the needles (case-insensitive).
for c in cols:
u = c.upper()
if all(n.upper() in u for n in needles):
return c
return None
# Five standard size classes by retail population served.
def size_class(pop):
try:
p = float(pop)
except (TypeError, ValueError):
return "unknown"
if p <= 500: return "very small (<=500)"
if p <= 3300: return "small (501-3,300)"
if p <= 10000: return "medium (3,301-10,000)"
if p <= 100000: return "large (10,001-100,000)"
return "very large (>100,000)"
def inventory(state):
# Only active systems; PWS_ACTIVITY_CODE = 'A' filters out inactive PWSIDs.
rows = _rows("WATER_SYSTEM", "PRIMACY_AGENCY_CODE", "=", state)
df = pd.DataFrame(rows)
if df.empty:
print(f"No water-system records returned for {state}.")
return df
act = _find(df.columns, "ACTIVITY")
if act:
df = df[df[act].astype(str).str.upper() == "A"]
return df
def analyze(state):
df = inventory(state)
if df.empty:
return
type_col = _find(df.columns, "PWS", "TYPE")
pop_col = _find(df.columns, "POPULATION") or _find(df.columns, "POP", "SERVED")
src_col = _find(df.columns, "SOURCE")
df["size_class"] = df[pop_col].apply(size_class)
print(f"{state}: {len(df):,} active public water systems")
# --- Systems by type -------------------------------------------------
print("\n Systems by type (CWS / NTNCWS / TNCWS):")
for t, n in df[type_col].value_counts().items():
print(f" {t:<10} {n:>7,}")
# --- The headline comparison: count of systems vs population served --
pop = pd.to_numeric(df[pop_col], errors="coerce").fillna(0)
grp = df.assign(pop=pop).groupby("size_class")["pop"].agg(["count", "sum"])
total_pop = grp["sum"].sum()
order = ["very small (<=500)", "small (501-3,300)", "medium (3,301-10,000)",
"large (10,001-100,000)", "very large (>100,000)"]
print("\n Size class systems %sys population %pop")
for cls in order:
if cls not in grp.index:
continue
c, s = int(grp.loc[cls, "count"]), int(grp.loc[cls, "sum"])
print(f" {cls:<22}{c:>6,} {c/len(df):>6.1%} {s:>13,} {s/max(total_pop,1):>7.1%}")
# --- Source water mix ------------------------------------------------
if src_col:
print("\n Primary water source:")
for s, n in df[src_col].value_counts().head(6).items():
print(f" {s:<6} {n:>7,}")
analyze("TX")
Two practical notes apply. First, the active-system filter is essential, not optional. Inactive PWSIDs—systems that have merged, been deactivated, or gone out of service—remain in the inventory, and a raw row count that does not filter on pws_activity_code = 'A' will overstate the number of systems actually operating today: the full file runs to roughly 400,000 records, while only about 150,000 are active. The script applies the filter before counting, and any “how many systems are there” question should do the same. Second, for national-scale work—ranking every primacy agency, or building the full violations-joined exposure analysis—EPA's SDWIS public data download files (and the ECHO bulk data services) are far more efficient than thousands of paginated API calls and ship with the authoritative, version-stamped column definitions for the release. The API is ideal for one state or one slice; the bulk download is the right tool for the whole country.
Limitations and analytical caveats
The inventory is the most comprehensive public census of drinking-water systems in the United States, but it carries structural limitations that an analyst must internalize before drawing conclusions from it.
State reporting varies, because the states are the registrars. Under primacy, the inventory is built and maintained by fifty-odd independent state programs, each with its own data systems, its own conventions for classifying a system's type and source, and its own discipline about forwarding updates to EPA. Apparent differences between states in the number of systems, in the mix of types, or in how source water is coded may partly reflect these registration and coding differences rather than real differences in the underlying water landscape. Cross-state comparisons should be made with this firmly in mind.
Inactive systems clutter the file, and population counts are estimates. Deactivated and merged systems persist in the inventory with an inactive status code—they make up well over half of the roughly 400,000 records on file—so any count that does not filter them out will be badly inflated; the active filter is the difference between counting the roughly 150,000 operating systems and counting historical PWSIDs. And the population_served_count is an estimate, not a meter reading. For a small system it may be a rough headcount; for a large utility it reflects a service-area calculation. The estimates are updated on the states' schedules and can lag actual growth or decline. Treat population served as a serviceable order-of-magnitude figure for sizing and weighting, not as a precise population statistic.
The inventory excludes private wells, and a record is a classification, not a guarantee of service quality. Tens of millions of Americans drink from private household wells that fall below the public-water-system threshold and never appear in SDWIS at all; the inventory describes the regulated landscape, not the whole of American drinking water. And the inventory tells you what a system is—its type, size, and source—not how well it is run. A clean inventory row says nothing about whether the system is meeting its standards; that lives in the violations and site-visit data the inventory exists to anchor. Treating presence in the inventory, or a particular classification, as evidence of water quality reads more into the dataset than it can bear.
Held with these caveats in mind, the epa_sdwa_water_systems table is a uniquely valuable resource: a system-resolved census of every operation that pipes drinking water to the American public, from the largest city utility to the smallest campground well—the registry that turns a stream of anonymous violation and inspection codes into a map of who serves whom, where the water comes from, and which of the country's tens of thousands of active systems are too small to manage the safety the Safe Drinking Water Act demands of them.
Related writing
EPA Drinking Water Violations: The Federal Database Behind Safe Drinking Water Act Enforcement — The violations record is the event table that hangs off this inventory: keyed by the same PWSID, it only becomes interpretable—how many people affected, which system type, what source—once joined to the inventory's size, type, and population fields.
EPA Safe Drinking Water Act Site Visits: The Federal Record of Public Water System Inspections — The sanitary-survey record is the preventive half of the program, and the inventory is what lets an analyst normalize inspection frequency by system type and size and test whether the smallest systems are being reached at all.
EPA RCRA Hazardous Waste Data: The Federal Database Behind 400,000 Regulated Facilities — Source-water vulnerability is a cross-program question, and through EPA's facility linkage a public water system in this inventory can be related to the RCRA hazardous-waste handlers whose releases threaten the groundwater it draws from.