Technical writing

IRS Exempt Organizations: The Federal Database Behind 1.26 Million US Nonprofits

· AI Analytics
IRSNonprofits501(c)(3)Tax-ExemptFederal Data

The Internal Revenue Service maintains the authoritative registry of every tax-exempt organization in the United States—1.26 million currently active entities plus millions more historically terminated. The Exempt Organizations Business Master File (EO BMF) is the canonical source: a monthly-updated flat file covering 501(c)(3) public charities, 501(c)(4) social welfare organizations, 501(c)(6) trade associations, 527 political organizations, and 28 other IRC subsection categories. Layered on top, the IRS publishes every electronically filed Form 990 as machine-readable XML on an AWS S3 bucket accessible to anyone without authentication. Together these two datasets constitute the most comprehensive public financial record of the US nonprofit sector—a $2.8 trillion annual revenue ecosystem employing roughly 12 million Americans, approximately 5.5 percent of GDP.

Scale and composition of the tax-exempt sector

The nonprofit sector's aggregate scale is routinely underestimated because it is distributed across dozens of IRC subsection categories that are rarely analyzed together. The 501(c)(3) category is by far the largest: roughly one million organizations encompassing public charities (hospitals, universities, food banks, environmental advocacy organizations, civil rights groups) and private foundations (family foundations, corporate foundations, operating foundations). 501(c)(3) public charities reported approximately $2.1 trillion in total revenues in the most recent complete year of IRS Statistics of Income data.

The 501(c)(4) social welfare category contributes approximately 80,000 registered organizations, including neighborhood associations, homeowners associations, volunteer fire departments, and the class of politically active organizations that became prominent after the Supreme Court's 2010 Citizens United v. FEC decision. 501(c)(6) trade associations and professional organizations account for approximately 60,000 registrations, covering every industry sector from the American Medical Association to the National Association of Home Builders. 501(c)(7) social clubs (country clubs, fraternal organizations) number around 13,000. The 527 political organization category, which explicitly covers political parties, PACs, and campaign committees, contains several thousand active registrations.

Religious organizations represent the largest single gap in the public data. Churches, integrated auxiliaries, and most religious orders are exempt from the Form 990 filing requirement under IRC § 6033(a)(3). A religious organization can claim tax-exempt status without IRS determination and without filing annual returns; many do so legitimately under the Church Audit Procedures Act (IRC § 7611), which severely restricts the IRS's ability to examine church finances and requires approval from a high-level Treasury official before initiating any church tax inquiry. The practical result is that thousands of organizations receiving tax benefits have zero public financial disclosure. Estimates suggest the religious sector controls hundreds of billions in tax-exempt assets, but the BMF captures only those that chose to seek an IRS determination letter.

The EO BMF: structure and field layout

The IRS publishes the EO BMF at https://www.irs.gov/charities-non-profits/exempt-organizations-business-master-file-extract-eo-bmf as a set of tab-delimited flat files updated monthly. The national file is distributed as a ZIP archive; the IRS also publishes four regional files (eo1.csv through eo4.csv) covering the Northeast, Southeast, Midwest, and West plus territories respectively. Each row represents one tax-exempt organization. Key fields:

  • EIN: the nine-digit Employer Identification Number, zero-padded, no hyphens. This is the primary stable identifier for a nonprofit across all federal databases—IRS filings, USASpending.gov awards, FPDS contracts, SAM.gov registrations, and state charity registrations all link via EIN. Organization names change; EINs do not (barring rare IRS correction of duplicate assignment).
  • SUBSECTION: a two-digit code for the IRC exemption subsection. 03 = 501(c)(3), 04 = 501(c)(4), 06 = 501(c)(6), 07 = 501(c)(7), 27 = 527. Values 01 through 29 map to IRC subsections 501(c)(1) through 501(c)(29).
  • FOUNDATION: a two-digit code distinguishing types within 501(c)(3). Foundation code 15 = private foundation; 02 = private operating foundation; 09–14 = various public support types (509(a)(1) through 509(a)(3)); 17 = general purpose public charity not yet classified. Private foundations (code 15) are subject to a completely different regulatory regime than public charities, covered in detail below.
  • RULING: a six-digit YYYYMM date indicating when the IRS granted tax-exempt status. This field is valuable for cohort analysis: organizations ruled in a particular year can be tracked forward to observe survival, growth, and revocation patterns. The IRS conducts periodic revocation sweeps under the Pension Protection Act (PPA) of 2006, which automatically revokes the exemption of any organization that fails to file for three consecutive years; post-PPA revocations are reflected as removed entries from subsequent BMF editions.
  • DEDUCTIBILITY: code 1 means contributions to the organization are deductible for donors as charitable contributions under IRC § 170; code 2 means they are not. All 501(c)(3) organizations should have deductibility code 1; 501(c)(4), (c)(6), and most other categories have code 2 (dues and contributions may be deductible as business expenses but not as charitable contributions).
  • ORGANIZATION: type code 1 = corporation, 2 = trust, 3 = cooperative, 4 = partnership, 5 = association. The organizational form matters for governance: corporations have boards; trusts have trustees; the distinction affects state law obligations and IRS treatment of certain transactions.
  • ASSET_CD / INCOME_CD: range codes (0–9) giving bucketed asset and income amounts. Code 0 = unknown or $0; code 9 = $50M or more. These are coarse but sufficient for stratified sampling: a researcher wanting to study mid-size nonprofits can filter to ASSET_CD = 6 or 7 ($1M–$10M) without downloading all 990 XML.
  • NTEE_CD: the National Taxonomy of Exempt Entities code, a three-character alphanumeric. The first character is the major group letter (A through Z); the following two digits specify the subdivision. An organization assigned NTEE code E22 is in major group E (Health) subdivision 22 (Hospitals). B82 is Education—Scholarships. P20 is Human Services—Multipurpose. The NTEE_CD field is populated for a majority of BMF entries but is blank or unreliable for older organizations and for religious organizations that did not pursue a standard determination process.

NTEE: the nonprofit classification system

The National Taxonomy of Exempt Entities was developed by the National Center for Charitable Statistics (NCCS) at the Urban Institute in partnership with the IRS. It provides a hierarchical classification of nonprofit organizational purposes across 26 major categories designated by letters A through Z. Within each major group, two-digit numeric subdivisions identify specific activity types; the IRS further appends a letter suffix for some types to indicate organizational characteristics. The taxonomy is more granular than it appears from the major category list—the Education (B) category alone contains over 30 distinct subdivision codes covering everything from preschools (B21) to graduate and professional schools (B50) to libraries (B70) to student financial aid (B82) to educational services (B90).

Major group distribution is highly skewed. Human Services (P) is typically the largest single category by organization count, followed by Religion (X) for organizations that voluntarily sought determination, Education (B), Health (E), and Arts/Culture/Humanities (A). Public/Society Benefit (W) and Philanthropy/Voluntarism (T) together cover most private foundations and community foundations. Science/Technology (U) and Social Science (V) are among the smallest major groups by organization count.

NTEE codes are assigned by the IRS based on information submitted during the exemption application process (Form 1023 for 501(c)(3)s, Form 1024 for most others). They are not always accurate—organizations that change their primary purpose do not automatically receive updated codes, and the IRS assignment process has historically applied codes inconsistently across regions and time periods. NCCS has developed a supplementary “NTEE-CC” (Core Code) system that regularizes codes and provides cleaner data for longitudinal research; this cleaned version is distributed through the NCCS data archive at Urban Institute and through Candid's GuideStar platform.

Form 990 e-file JSON and the IRS AWS S3 dataset

Since 2013 the IRS has posted electronically filed Form 990, 990-EZ, and 990-PF returns to a public AWS S3 bucket at s3://irs-form-990/. The bucket requires no authentication for read access. The AWS CLI command aws s3 ls s3://irs-form-990/ --no-sign-request lists the available objects including annual index files and individual XML returns.

Annual index files at https://s3.amazonaws.com/irs-form-990/index_{year}.json list every filing in the dataset for that tax year. Each index record contains the organization EIN, name, tax period (YYYYMM), form type (990, 990EZ, 990PF), object ID, and a direct HTTPS URL to the XML filing. The 2022 index alone lists several hundred thousand filings. Across all available years (approximately 2011 through present), the dataset contains roughly four to five million individual returns.

The Form 990 XML follows an IRS-defined schema that has evolved across tax years. The XML namespace for recent filings is http://www.irs.gov/efile; older filings use variants of this namespace with year-specific suffixes. Any production parser must handle namespace variation. The top-level structure is a Return element containing a ReturnHeader (EIN, organization name, tax year, preparer information) and one or more ReturnData elements containing the substantive financial data organized by form part.

Key financial fields accessible in 990 XML by XPath (within the IRS efile namespace):

  • Part I summary totals: CYTotalRevenueAmt, CYTotalExpensesAmt, TotalAssetsEOY, TotalLiabilitiesEOY, NetAssetsOrFundBalancesEOY. These provide a balance-sheet snapshot sufficient for most financial screening.
  • Part VII officer compensation: PersonNm, TitleTxt, ReportableCompFromOrgAmt (W-2 box 5 wages), OtherCompensationAmt (non-taxable benefits). The five highest-paid employees earning over $100,000, plus all officers and directors at any compensation level, must be listed. For a major hospital system, this section alone may list dozens of executives.
  • Part IX functional expenses: the three-way split between program services (TotalFunctionalExpensesAmt column program), management and general, and fundraising. The program expense ratio—program services divided by total expenses—is the primary efficiency metric used by charity watchdog organizations.
  • Schedule C political and lobbying activity: PoliticalCampaignActyInd (yes/no flag for any participation in political campaign activity), LobbyingActivitiesInd, DirectLobbyingExpenses. For 501(c)(3) organizations, any “yes” response on political campaign activity is a major red flag.
  • Schedule L related-party transactions: loans to or from officers, directors, or key employees; business transactions with interested persons. This schedule surfaces self-dealing that warrants IRS attention.

The Form 990-PF (private foundation return) has a different schema from the standard 990. It is required for all private foundations regardless of size—even a small family foundation with $500,000 in assets must file. The 990-PF Line 1 reports total investment income; Part VI computes the 1.39 percent excise tax on net investment income (reduced from 2 percent in 2019); Part IX lists all grants and contributions paid during the year, which is the only public record of where foundation grant dollars actually go. For major foundations like the Bill & Melinda Gates Foundation ($70B+ assets), Ford Foundation ($16B), Robert Wood Johnson Foundation ($13B), and Bloomberg Philanthropies, the 990-PF grant list is the primary public disclosure of philanthropic strategy.

ProPublica Nonprofit Explorer and the search API

ProPublica's Nonprofit Explorer at projects.propublica.org/nonprofits provides a public-facing search interface over parsed 990 data, and more importantly exposes a JSON API that developers can query programmatically without the overhead of downloading and parsing raw IRS XML. The API base URL is https://projects.propublica.org/nonprofits/api/v2/.

The organization detail endpoint at /organizations/{ein}.json returns a comprehensive object including: EIN, organization name, city, state, NTEE code, subsection code, classification codes, ruling date, deductibility code, foundation code, tax period, filing requirements, asset amount, income amount, Form 990 revenue amount, number of employees, total revenue, total expenses, net income, total assets, and a filing history array with a URL to each 990 PDF and XML for all available years. The filing history is one of ProPublica's most useful features: it provides direct links to each year's return without requiring the caller to know the S3 object key structure.

The search endpoint at /search.json accepts query parameters including q (text search against organization name), state[id] (two-letter state code), ntee[id] (NTEE major group letter), c_code[id] (subsection code, e.g., 3 for 501(c)(3)), and sort options including revenue and name. Rate limits are 5,000 requests per day with an API key. The ProPublica API does not require authentication for basic use, though heavy use should be accompanied by a free API key registration.

Candid (formerly GuideStar) maintains a premium data platform covering 990 data with additional enrichments including organization profiles, leadership contacts, and validation against state charity registrations. Candid's data is the source relied upon by most commercial due diligence systems and grant management platforms. The Urban Institute's NCCS Data Archive provides cleaned panel datasets suitable for academic longitudinal research, including harmonized field names across form versions and tax years.

Schedule A and the public support test

IRC § 509(a) divides 501(c)(3) organizations into two fundamental categories—public charities and private foundations—based on the breadth of their financial support. A 501(c)(3) is presumed to be a private foundation unless it can demonstrate it meets one of the public support tests under § 509(a)(1) or § 509(a)(2). Schedule A of Form 990 documents this determination.

Under the § 509(a)(1) test (the most common), a public charity must demonstrate over a rolling five-year computation period that it receives at least 33.3 percent of its support from the general public (government grants and contributions from the public counting together), and that investment income and unrelated business income do not exceed one-third of total support. A large donor who provides more than 2 percent of total support gets capped at 2 percent in the numerator, ensuring that a single major donor cannot by itself satisfy the test.

Under the § 509(a)(2) test, used primarily by membership organizations and social service agencies that earn fees for services, more than one-third of support must come from fees for exempt function services plus government and public contributions, and investment income must not exceed one-third of total support.

An organization that fails the public support test in a given year may still qualify under a “facts and circumstances” exception if it attracts at least 10 percent public support and can demonstrate other indicia of public character. An organization that fails entirely reclassifies as a private foundation and becomes subject to the full private foundation excise tax regime.

Private foundations: the excise tax regime

Approximately 100,000 organizations in the BMF carry foundation code 15 (private foundation). These are typically funded by a single donor, family, or corporation rather than by broad public support. Congress subjected private foundations to a strict excise tax regime in the Tax Reform Act of 1969 in response to abuses by foundations used primarily to maintain family control of businesses or to engage in self-dealing.

The current excise tax structure for private foundations includes five major categories of prohibited transactions and taxes:

  • Net investment income tax (IRC § 4940): a 1.39 percent excise tax on net investment income (interest, dividends, capital gains, rents, and royalties less investment expenses). All private foundations pay this tax annually on Form 990-PF Part VI regardless of their grant-making activity. Before 2020 a two-tiered rate (2 percent standard, 1 percent if distributions exceeded historical average) applied; the Taxpayer Certainty and Disaster Tax Relief Act of 2019 simplified this to a flat 1.39 percent.
  • Self-dealing prohibitions (IRC § 4941): absolute prohibition on any financial transactions between a private foundation and its disqualified persons (founders, substantial contributors, foundation managers, government officials, and their family members and controlled businesses). Covered transactions include loans, sales, leases, compensation arrangements not approved as reasonable, and transfers of assets. Penalties are severe: an initial excise tax of 10 percent of the transaction amount on the disqualified person, plus 5 percent on the foundation manager who approved it; if the violation is not corrected, additional taxes of 200 percent and 50 percent respectively apply.
  • Mandatory distribution (IRC § 4942): a private foundation must distribute at least 5 percent of its net asset value (measured at fair market value, averaged over the year) annually for charitable purposes. Failure to meet the distributable amount triggers a 30 percent excise tax on the shortfall. “Qualifying distributions” include grants to public charities, reasonable administrative expenses, and direct charitable expenditures; program-related investments (PRIs) such as below-market loans to charitable projects also count. The 5 percent minimum distribution rule is one of the most consequential structural features of US philanthropic law—it forces asset liquidation and charitable deployment at a scale that shapes the entire US grant economy.
  • Excess business holdings (IRC § 4943): a private foundation and its disqualified persons together cannot hold more than 20 percent of the voting stock of a business enterprise, reduced to 2 percent if any disqualified person holds more than 20 percent. This rule was designed to prevent foundations from being used as vehicles for maintaining control of family businesses. Five-year divestiture grace periods apply to holdings acquired by gift or bequest.
  • Jeopardizing investments (IRC § 4944): a private foundation may not make investments that jeopardize its charitable purpose—speculative ventures that could impair the foundation's ability to carry out its exempt purposes. The initial excise tax is 10 percent of the investment amount on the foundation; 10 percent on the foundation manager who approved it.
  • Taxable expenditures (IRC § 4945): expenditures for lobbying, political campaign activity, grants to individuals without prior IRS approval of the grant procedure, grants to non-public-charity organizations without “expenditure responsibility” (ongoing oversight of how funds are used), and any other non-charitable purpose. Excise tax rates parallel the self-dealing structure.

The entire private foundation regime is visible in public data: the 990-PF discloses the foundation's assets at fair market value, its net investment income, excise tax paid, distributable amount, actual qualifying distributions, grants paid (by recipient organization and purpose), and any excise tax corrections from prior years. This makes the Gates Foundation, Ford Foundation, MacArthur Foundation, and every other private foundation's grant strategy a matter of public record available in the IRS XML dataset.

501(c)(3) vs. 501(c)(4): political activity rules

The most consequential distinction in nonprofit law for purposes of political transparency is between 501(c)(3) public charities and 501(c)(4) social welfare organizations. These two categories sit adjacent in the IRC but operate under radically different political activity constraints and donor disclosure requirements.

A 501(c)(3) organization is absolutely prohibited from participating in any political campaign activity on behalf of or in opposition to any candidate for public office under IRC § 501(c)(3)'s direct statutory language. This prohibition is categorical: endorsing a candidate, making a financial contribution to a candidate, or using organizational resources to support or oppose a candidacy triggers loss of exempt status and potential imposition of excise taxes under IRC § 4955 (initial tax of 10 percent of the expenditure, plus correction requirements; for willful violations, an additional 100 percent tax applies). The IRS enforces this through audit, though historically enforcement has been uneven. A 501(c)(3) may engage in non-partisan voter education, candidate forums, and advocacy on legislative issues—but never in activities that favor or oppose a specific candidate.

Lobbying by 501(c)(3) organizations is permitted but regulated. Organizations may elect to use the “H election” under IRC § 501(h), which provides bright-line dollar limits on lobbying expenditures (20 percent of exempt purpose expenditures up to $1 million total, with a sub-limit of 25 percent of that amount for direct lobbying of legislators as opposed to grassroots lobbying of the public). Organizations not making the H election are subject to the older “substantial part” test, under which no substantial part of an organization's activities may constitute attempting to influence legislation. The vagueness of the substantial part test makes it riskier in practice.

A 501(c)(4) social welfare organization operates under the standard that its primary purpose must be the promotion of the common good and general welfare of the community—but political activity may be a secondary activity so long as it is not the organization's primary purpose. The IRS historically interpreted “primary purpose” as meaning more than 50 percent of activities; in practice, major 501(c)(4) political spending organizations have operated well into the majority-political-activity range while maintaining (c)(4) status. Post-Citizens United, 501(c)(4) organizations became the dominant vehicle for unlimited corporate and individual political spending because (a) they are not required to register with the FEC as political committees so long as political activity is not their primary purpose, (b) their donors are not publicly disclosed in IRS filings (Schedule B names are filed but not made public), and (c) corporate treasury funds can be used. Examples include Sierra Club Foundation's affiliated Sierra Club (environmental advocacy and political endorsements), Americans for Prosperity (Koch network), Planned Parenthood Action Fund, and the NAACP (as distinct from the NAACP Legal Defense Fund, which is a separate 501(c)(3)).

The 2013 IRS “targeting controversy”—commonly associated with IRS official Lois Lerner—involved the use of Be On the Lookout (BOLO) lists that disproportionately flagged applications from Tea Party-affiliated 501(c)(4) organizations for additional scrutiny. The Treasury Inspector General for Tax Administration (TIGTA) documented the improper use of political criteria in the exemption application process. Congress responded with the Bipartisan Budget Act of 2018, which among other provisions required new 501(c)(4) organizations to file Form 8976 notice within 60 days of formation—the first formal notice requirement for (c)(4)s, which previously could operate without any IRS contact prior to filing their first 990.

The 527 political organization category provides an explicit home for entities whose primary purpose is political. 527 organizations must report their donors and expenditures to either the FEC (if they make independent expenditures in federal elections) or the IRS (if they operate solely at state and local levels). Unlike 501(c)(4) donors, 527 donors are publicly disclosed. Major 527 organizations include the national party committees, most state party organizations, some leadership PACs, and the older class of “527 groups” (like Swift Boat Veterans for Truth) that emerged before the FEC clarified its jurisdiction over such entities.

Schedule B and the donor disclosure debate

Schedule B of Form 990, the Schedule of Contributors, requires organizations to list every donor who contributed more than the greater of $5,000 or 2 percent of total receipts during the tax year, along with the donor's name, address, and the amount contributed. Schedule B is filed with the IRS as part of the official 990 submission, but it is not included in the public inspection copy of the return required under IRC § 6104. The IRS redacts Schedule B from copies provided to the public or made available through TEOS.

The confidentiality of Schedule B has been the subject of sustained litigation. The fundamental tension is between donor privacy (the First Amendment right of association, which the Supreme Court in NAACP v. Alabama (1958) recognized as protecting donor identities from disclosure when disclosure would chill protected association) and public accountability for organizations receiving significant tax subsidies. For 501(c)(3) charities, the balance has traditionally favored privacy. For 501(c)(4) organizations engaged in substantial political activity, critics argue that Schedule B confidentiality enables anonymous political spending that undermines democratic accountability.

California's Attorney General historically required 501(c)(4) organizations soliciting donations in California to file Schedule B with the state registry. The Supreme Court struck down this requirement in Americans for Prosperity Foundation v. Bonta (2021), holding it facially unconstitutional under the First Amendment because the state could not demonstrate that its across-the-board disclosure requirement was narrowly tailored to its interest in investigating charitable misconduct. The decision makes state-level Schedule B disclosure requirements constitutionally precarious going forward.

Accessing the data programmatically

The IRS EO BMF is available at the URL above in ZIP format (national) or as individual regional CSV files eo1.csv through eo4.csv. No API key or registration is required. The files are updated monthly; researchers building databases should implement a differential ingestion process comparing the current month's BMF against prior snapshots to detect new organizations, revocations, and address changes. The IRS does not publish a delta file; full snapshots must be compared.

For Form 990 XML, the AWS S3 bucket s3://irs-form-990/ is accessible with the --no-sign-request flag in the AWS CLI, meaning no AWS credentials are required. The annual index files at https://s3.amazonaws.com/irs-form-990/index_{year}.json are HTTPS-accessible without any authentication. Individual return XMLs are referenced by the URL field in each index record and can be fetched with a standard HTTP client. The total size of all XML filings across all years runs to tens of terabytes; researchers working with the full corpus typically use AWS data transfer pricing and process in EC2 to avoid egress charges.

For most analytical purposes, the ProPublica API is substantially faster than raw S3 XML for retrieving summary financials on a known EIN list. For full-text analysis of specific schedules (executive compensation detail, grant lists on 990-PF, related-party transactions on Schedule L), direct XML parsing from S3 is necessary. Candid's paid API provides the most complete data with the highest data quality, but the IRS and ProPublica data are sufficient for most academic, journalistic, and compliance use cases.

Python example: BMF subsector count and ProPublica top-charity lookup

The following script performs three analyses. First, it downloads the IRS EO BMF national file, filters to active 501(c)(3) organizations, and counts them by NTEE major category to produce a subsector concentration table. Second, it queries the ProPublica Nonprofit Explorer API for the top-10 largest public charities by Form 990 revenue. Third, it produces an asset distribution table for Human Services (NTEE major group P)—the largest 501(c)(3) subsector by organization count— using the asset range codes in the BMF.

import requests
import pandas as pd
import csv
import io
import time
from collections import defaultdict
from typing import Any

# ---------------------------------------------------------------------------
# IRS Exempt Organizations: BMF + ProPublica API Analysis
#
# Part 1: Download the IRS EO BMF national file and count active 501(c)(3)
#         organizations by NTEE major category (A through Z).
# Part 2: Fetch the top-10 largest public charities by Form 990 revenue
#         from the ProPublica Nonprofit Explorer API.
# Part 3: Print a formatted subsector concentration table.
#
# IRS BMF national file (tab-delimited, updated monthly):
#   https://www.irs.gov/charities-non-profits/exempt-organizations-business-master-file-extract-eo-bmf
# ProPublica API (no key required for basic search):
#   https://projects.propublica.org/nonprofits/api/v2/
# ---------------------------------------------------------------------------

BMF_URL = (
    "https://apps.irs.gov/pub/epostcard/data-download-eobmf.zip"
)
# Fallback: IRS also publishes four regional CSV files:
# eo1.csv (NE), eo2.csv (SE), eo3.csv (MW), eo4.csv (W/territories)
# Full national: https://apps.irs.gov/pub/epostcard/data-download-eobmf.zip
# Column layout (tab-delimited, no header row on older files; new files include header):
#   EIN, NAME, ICO, STREET, CITY, STATE, ZIP, GROUP, SUBSECTION, AFFILIATION,
#   CLASSIFICATION, RULING, DEDUCTIBILITY, FOUNDATION, ACTIVITY, ORGANIZATION,
#   STATUS, TAX_PERIOD, ASSET_CD, INCOME_CD, FILING_REQ_CD, PF_FILING_REQ_CD,
#   ACCT_PD, ASSET_AMT, INCOME_AMT, F990_REV_AMT, NTEE_CD

PROPUBLICA_SEARCH = "https://projects.propublica.org/nonprofits/api/v2/search.json"

# NTEE major category names (26 groups, A through Z)
NTEE_LABELS: dict[str, str] = {
    "A": "Arts, Culture, Humanities",
    "B": "Education",
    "C": "Environment",
    "D": "Animal Services",
    "E": "Health",
    "F": "Mental Health",
    "G": "Disease / Disorder Research",
    "H": "Medical Research",
    "I": "Crime & Legal",
    "J": "Employment",
    "K": "Food, Agriculture & Nutrition",
    "L": "Housing & Shelter",
    "M": "Public Safety, Disaster",
    "N": "Recreation & Sports",
    "O": "Youth Development",
    "P": "Human Services",
    "Q": "International / Foreign Affairs",
    "R": "Civil Rights / Advocacy",
    "S": "Community Improvement",
    "T": "Philanthropy / Voluntarism",
    "U": "Science & Technology",
    "V": "Social Science Research",
    "W": "Public / Society Benefit",
    "X": "Religion",
    "Y": "Mutual / Membership Benefit",
    "Z": "Unknown / Unclassified",
}


def fetch_bmf_national() -> list[dict[str, str]]:
    """
    Download the IRS EO BMF national CSV (tab-delimited) and return a list of
    row dicts. Filters to SUBSECTION == '03' (501c3) and STATUS == '1' (active).

    The BMF is published as a ZIP archive containing a single CSV. Column names
    are on the first row. EIN is a 9-digit string (no hyphens). NTEE_CD is a
    3-character code like 'E22' (major group letter + 2-digit subdivision).
    """
    print("Downloading IRS EO BMF national file...")
    resp = requests.get(BMF_URL, timeout=180)
    resp.raise_for_status()

    import zipfile, io as _io
    zf = zipfile.ZipFile(_io.BytesIO(resp.content))
    csv_name = [n for n in zf.namelist() if n.endswith(".csv")][0]
    raw = zf.read(csv_name).decode("latin-1")

    reader = csv.DictReader(io.StringIO(raw), delimiter=",")
    rows = []
    for row in reader:
        # STATUS 1 = exempt; SUBSECTION 03 = 501(c)(3)
        if row.get("SUBSECTION", "").strip() == "03" and row.get("STATUS", "").strip() == "1":
            rows.append(row)
    print(f"Active 501(c)(3) organizations in BMF: {len(rows):,}")
    return rows


def count_by_ntee_major(rows: list[dict[str, str]]) -> pd.DataFrame:
    """
    Extract the NTEE major category letter (first character of NTEE_CD) and
    count organizations per major group. Rows with blank NTEE_CD fall into 'Z'.
    """
    counts: dict[str, int] = defaultdict(int)
    for row in rows:
        ntee = row.get("NTEE_CD", "").strip()
        major = ntee[0].upper() if ntee else "Z"
        if major not in NTEE_LABELS:
            major = "Z"
        counts[major] += 1

    records = [
        {
            "major": k,
            "label": NTEE_LABELS[k],
            "count": counts.get(k, 0),
        }
        for k in sorted(NTEE_LABELS.keys())
    ]
    df = pd.DataFrame(records)
    df["pct"] = 100.0 * df["count"] / df["count"].sum()
    return df.sort_values("count", ascending=False).reset_index(drop=True)


def fetch_top_public_charities(n: int = 10) -> list[dict[str, Any]]:
    """
    Query ProPublica Nonprofit Explorer search API for 501(c)(3) public charities
    (c_code[id]=3, excluding private foundations foundation_code != 15) sorted
    by Form 990 revenue. ProPublica returns organizations with parsed 990 summary
    financials. Rate limit: 5,000 requests/day; no API key required for search.
    """
    print("Fetching top public charities from ProPublica API...")
    params = {
        "c_code[id]": "3",       # 501(c)(3)
        "order": "revenue",
        "sort_order": "desc",
        "per_page": n,
        "page": 0,
    }
    resp = requests.get(PROPUBLICA_SEARCH, params=params, timeout=30)
    resp.raise_for_status()
    data = resp.json()
    orgs = data.get("organizations", [])
    return orgs


def print_ntee_table(df: pd.DataFrame) -> None:
    total = df["count"].sum()
    print()
    print(f"Active 501(c)(3) Organizations by NTEE Major Category  (n={total:,})")
    print()
    print(f"  {'#':>3}  {'Cat':>3}  {'Label':<38}  {'Count':>8}  {'Share':>7}")
    print("  " + "-" * 68)
    for i, row in df.iterrows():
        print(
            f"  {i+1:>3}  {row['major']:>3}  {row['label']:<38}"
            f"  {int(row['count']):>8,}  {row['pct']:>6.1f}%"
        )
    print()


def print_top_charities(orgs: list[dict[str, Any]]) -> None:
    print("Top Public Charities by Form 990 Revenue (ProPublica)")
    print()
    print(f"  {'#':>3}  {'EIN':>12}  {'Name':<45}  {'Revenue ($M)':>13}")
    print("  " + "-" * 80)
    for i, org in enumerate(orgs, 1):
        name = (org.get("name") or "")[:45]
        ein = org.get("ein") or ""
        revenue = org.get("form990_revenue_amount") or 0
        try:
            rev_m = float(revenue) / 1_000_000
        except (TypeError, ValueError):
            rev_m = 0.0
        print(f"  {i:>3}  {ein:>12}  {name:<45}  ${rev_m:>11.1f}M")
    print()


def main() -> None:
    # --- Part 1: BMF subsector count ---
    bmf_rows = fetch_bmf_national()
    ntee_df = count_by_ntee_major(bmf_rows)
    print_ntee_table(ntee_df)

    # --- Part 2: ProPublica top charities ---
    time.sleep(1)  # polite pause before API call
    top_orgs = fetch_top_public_charities(n=10)
    print_top_charities(top_orgs)

    # --- Part 3: Cross-tabulate BMF asset ranges for Human Services (P) ---
    # ASSET_CD codes: 0=unknown, 1=<$10k, 2=$10k-$25k, 3=$25k-$100k,
    #                 4=$100k-$500k, 5=$500k-$1M, 6=$1M-$5M, 7=$5M-$10M,
    #                 8=$10M-$50M, 9=$50M+
    ASSET_LABELS = {
        "0": "Unknown", "1": "<$10k", "2": "$10k-$25k", "3": "$25k-$100k",
        "4": "$100k-$500k", "5": "$500k-$1M", "6": "$1M-$5M",
        "7": "$5M-$10M", "8": "$10M-$50M", "9": "$50M+",
    }
    p_rows = [r for r in bmf_rows if (r.get("NTEE_CD") or "").startswith("P")]
    asset_counts: dict[str, int] = defaultdict(int)
    for r in p_rows:
        cd = r.get("ASSET_CD", "0").strip() or "0"
        asset_counts[cd] += 1

    print(f"Human Services (NTEE P) asset distribution  (n={len(p_rows):,})")
    print()
    print(f"  {'Asset Range':<20}  {'Count':>8}  {'Share':>7}")
    print("  " + "-" * 42)
    for cd in sorted(ASSET_LABELS.keys()):
        ct = asset_counts.get(cd, 0)
        pct = 100.0 * ct / len(p_rows) if p_rows else 0
        print(f"  {ASSET_LABELS[cd]:<20}  {ct:>8,}  {pct:>6.1f}%")
    print()


if __name__ == "__main__":
    main()

The BMF's NTEE_CD field will show Human Services (P) as the largest 501(c)(3) category by raw count, typically 15–20 percent of all active organizations. Education (B) and Religion (X, for organizations that sought determination) each contribute roughly 10–15 percent. Arts (A) is disproportionately large relative to its economic scale because small arts organizations are common and tend to pursue 501(c)(3) status for grant eligibility. The asset distribution for Human Services will show that the majority of registered organizations fall in the lower asset ranges (codes 1–4, under $500,000), reflecting the many small community service organizations; the aggregate financial weight of the sector is concentrated in a small number of large organizations in codes 8–9 ($10M+).

The ProPublica top-charity output will typically show major hospital systems and health insurance organizations at the top—entities like Kaiser Foundation Hospitals, UPMC, CommonSpirit Health, and Ascension Health report billions in annual Form 990 revenues. Universities appear in the next tier. This concentration reflects the well-documented fact that US nonprofit healthcare dominates the aggregate financial statistics of the 501(c)(3) sector: hospital systems collectively account for a larger share of 990 revenue than all other nonprofit subsectors combined.

For the federal foreign assistance database where many large US nonprofits—including Catholic Relief Services, Save the Children, World Vision, and the International Rescue Committee—appear as implementing partners receiving USAID contracts and grants, with award-level data on obligation amounts, recipient countries, DAC sector codes, and year-over-year disbursement trends, see USAID Foreign Aid Data: The Federal Database Behind $40 Billion in Annual US Development Assistance (2026-12-04).

For the Social Security Administration's OASDI dataset covering 70 million beneficiaries and $1.4 trillion in annual benefit payments—and the connection between nonprofit-sector employment and Form SSA-1099 Social Security contributions, including how nonprofit employees with 401(k)-equivalent 403(b) plans interact with Social Security benefit calculations through the Windfall Elimination Provision and Government Pension Offset—see Social Security OASDI: The Federal Data Behind $1.4 Trillion in Annual Benefits and 70 Million Recipients.