Technical writing
IRS Statistics of Income: The Federal Dataset Behind the US Tax and Income Distribution
Since 1916 — the third year of the modern income tax — the Internal Revenue Service has published aggregated statistics drawn from the tax returns it collects. That program, the Statistics of Income, is the definitive federal source on how income is distributed across the United States, how much of it is taxed, at what effective rates, and through which deductions and credits the tax burden is modified. No survey can match it: the SOI is grounded in administrative records covering essentially every individual and corporation required to file a federal return.
What the SOI Program Is
The Statistics of Income program, housed within the IRS Research, Applied Analytics, and Statistics division, compiles and publishes aggregated tabulations drawn from the full population of tax returns filed each year. Published at irs.gov/statistics, the SOI's flagship product is the Individual Income Tax Returns publication, drawn from Form 1040 filings. Companion publications cover estate and gift tax returns (Forms 706 and 709), corporate income tax returns (Form 1120), partnership returns (Form 1065), and nonfarm sole proprietorship returns (Schedule C).
The SOI does not release individual tax returns. Those are confidential under Internal Revenue Code Section 6103, which prohibits disclosure of return information except under specific statutory exceptions. What the SOI publishes are tabulations: aggregated counts, totals, and averages computed across groups of returns defined by income class, filing status, geographic area, or other characteristics. A separate microdata product — the Public Use File, discussed below — provides anonymized individual-return records for approved research purposes.
Publication follows the tax cycle with a roughly two-year lag. Data for tax year 2022 (returns filed in early 2023) typically appears in the SOI tables in 2024. The lag reflects the time required for IRS to process returns, run audit cycles, and compile statistically defensible tabulations from the full return population.
Individual Income Tax Statistics: Structure of the Tables
The individual income tax tables are organized primarily by Adjusted Gross Income class. AGI — gross income less above-the-line deductions such as student loan interest, alimony, and contributions to certain retirement accounts — is the central organizing variable in the individual tax system. The SOI reports tabulations across approximately twenty AGI brackets: $1 under $5,000; $5,000 under $10,000; continuing in steps through $500,000 under $1 million; $1 million under $1.5 million; and so on to $10 million or more.
For each AGI class, the SOI tables report the number of returns, the aggregate amount of each income type and deduction category, and the aggregate tax liability. Income categories include wages and salaries, taxable interest, qualified dividends, net business income, net capital gains, IRA distributions, pension and annuity income, and Social Security benefits received. Deduction categories distinguish between standard and itemized deductions; itemized subtotals include state and local taxes paid, mortgage interest, charitable contributions, and casualty losses. Tax liability is shown before and after credits, with separate lines for the alternative minimum tax, the child tax credit, the earned income tax credit, education credits, and the net investment income tax (the 3.8% surtax on investment income for high earners).
State-level SOI data is published separately, providing the same income-type and tax-liability tabulations broken down by state of residence rather than national AGI class. The state data is used extensively for comparing tax bases across states and for estimating the distributional effects of state-level tax policy.
Income Concentration and the Top 1 Percent
The SOI is the primary data source underlying the income inequality research program of economists Emmanuel Saez and Gabriel Zucman at UC Berkeley, building on earlier work by Saez and Thomas Piketty. Their estimates of the income share of the top 1%, top 0.1%, and top 0.01% — which have become central reference points in policy debates about inequality since the early 2000s — are computed directly from the SOI tabulations combined with national income aggregates from the Bureau of Economic Analysis.
The SOI data shows the approximate AGI threshold for membership in each top-income group. In recent years, the top 1% cutoff has been roughly $700,000 in AGI; the top 0.1% cutoff approximately $3.3 million; and the top 0.01% cutoff approximately $15 million, though these thresholds shift with economic conditions and capital market performance. The top 1% earns approximately 20% of all AGI reported on individual returns and pays approximately 40% of all federal individual income tax — figures that reflect both the concentration of pre-tax income and the progressive structure of the rate schedule.
Capital gains are the most concentrated income source in the SOI data. The top 1% of filers by AGI typically receives around 70% of all capital gains reported on individual returns in a given year. Because realized capital gains are taxed at preferential rates — a maximum statutory rate of 20%, plus the 3.8% net investment income tax for high earners, compared to a 37% top rate on ordinary income — the composition of income at the top of the distribution has direct implications for both revenue estimates and debates about the progressivity of the overall system. The SOI provides the empirical foundation for that debate in a way that no survey could, because capital gains are frequently underreported in household surveys relative to tax records.
The Earned Income Tax Credit Distribution
At the other end of the income distribution, the SOI provides equally detailed data on the Earned Income Tax Credit. The EITC is the largest cash transfer program for working families in the federal budget: approximately 25 million returns claim the credit in a typical year, with total EITC amounts distributed approaching $65 billion annually. Because the EITC is refundable — claimants receive the full credit amount even if their income tax liability is zero — it functions as a net cash payment for most recipients.
The SOI EITC tables show credit amounts by number of qualifying children (zero, one, two, three or more) and by income level, documenting both the phase-in and the phase-out of the credit. The phase-in structure rewards additional earned income for the lowest earners; the phase-out creates an effective marginal tax rate on earned income above the credit maximum that can interact unfavorably with state income taxes and benefit phase-outs from other programs. The SOI data allows researchers and policymakers to trace exactly where the benefit cliff falls and how many returns are affected at each point in the income range — analysis that is impossible from survey data alone because low-income households are both undersampled and more likely to misreport income or credit amounts in household surveys.
High-Income Tax Returns
The SOI publishes a dedicated annual report titled High-Income Tax Returns that focuses specifically on returns with AGI of $200,000 or more and provides additional breakdowns at the $1 million and above threshold. This publication tracks income composition, effective tax rates, itemized deductions, and tax credit usage for the top of the distribution in granular detail unavailable in the main tabulations.
One of the most policy-relevant findings from the high-income publication is the gap between statutory and effective tax rates at the top. The top statutory marginal rate on ordinary income has been 37% since the Tax Cuts and Jobs Act of 2017. But the median effective total income tax rate for returns with AGI of $1 million or more has generally ranged between 25% and 30% over the period since 2010. That gap arises from several sources: the substantial share of high incomes that consists of long-term capital gains and qualified dividends taxed at preferential rates; itemized deductions for charitable contributions, mortgage interest, and (before the SALT cap introduced by TCJA) state and local taxes; and income that is excluded from AGI through retirement account contributions and other above-the-line adjustments.
The effective rate data from the SOI is used in scoring proposed tax legislation. When Treasury or the Joint Committee on Taxation estimates the revenue effect of a proposed change to capital gains rates or the SALT deduction cap, the distribution of income, deductions, and current effective rates observable in the SOI tables provides the baseline from which the scoring model operates.
Estate Tax Statistics
The estate tax SOI publication, drawn from Form 706 filings, provides annual data on the gross estate composition, deductions, and federal estate tax paid by taxable estates. The estate tax applies only to estates above the applicable exemption amount — $13.6 million per decedent in 2024 under current law — so the number of taxable returns is small, typically a few thousand per year, but the economic values involved are large.
The SOI estate tax tables break gross estate value into asset categories: publicly traded stock, state and local bonds, federal bonds, other bonds, cash and deposits, real estate, closely held business interests, retirement assets, and life insurance. This composition data reveals the structure of wealth at the very top of the distribution in a way that no survey can, because wealthy households are systematically underrepresented in and more likely to misreport in household wealth surveys.
A recurring finding in the estate tax data is that the federal estate tax raises substantially less revenue than a mechanical application of the statutory rate to gross estates above the exemption would imply — typically on the order of $20 billion per year despite hundreds of billions in gross estate value above the exemption. Several factors explain the gap: the marital deduction (assets passing to a surviving spouse are not taxed at the first death); charitable bequests (excluded from the taxable estate); valuation discounts applied to closely held business interests and real estate; and the use of trusts and other planning vehicles that reduce the taxable estate. Perhaps the most significant factor is the stepped-up basis rule: unrealized capital gains embedded in estate assets are erased at death, meaning that appreciation that was never subject to income tax during the decedent's lifetime escapes both income tax and, through basis manipulation, often reduces estate tax exposure as well.
Corporate Income Tax Statistics
The corporate SOI, drawn from Form 1120 filings, reports annual aggregate statistics on C-corporations: total receipts, total assets, net income (and net deficit), income tax before credits, and the major tax credit categories including the research and development credit, the foreign tax credit, and general business credits. The data is reported by industrial sector and, for some tables, by asset size class.
The corporate SOI is the primary data source for measuring corporate effective tax rates over time. The Tax Cuts and Jobs Act of 2017 reduced the statutory corporate income tax rate from 35% to 21%. But the effective rate — corporate income taxes actually paid as a share of pre-tax book income or taxable income — was already well below the statutory rate before TCJA and remained so after. Accelerated depreciation allowances (including 100% bonus depreciation introduced by TCJA), the R&D credit, the foreign-derived intangible income deduction, and the ability to defer recognition of foreign profits combine to create effective rates for large corporations that differ substantially from the statutory rate. The SOI does not separately identify which corporations are taking which credits at what scale, but in aggregate it documents the wedge between statutory and effective rates that motivates ongoing policy debate about the corporate minimum tax.
The IRS Public Use File
In addition to the published tabulations, the IRS releases a de-identified microdata sample of individual income tax returns under the name Statistics of Income Public Use File. The PUF contains approximately 200,000 anonymized records drawn from a stratified random sample of the full return population, with oversampling of high-income returns to ensure statistical reliability at the upper end of the distribution. Each record includes the major income components, deduction categories, tax credits, filing status, and number of exemptions present on the underlying return, with dollar amounts top-coded and geographic identifiers suppressed or grouped to prevent re-identification.
The PUF is the standard input for microsimulation models of tax policy. Organizations including the Tax Policy Center, the Tax Foundation, and various academic groups maintain microsimulation models that take PUF records as the baseline population, apply proposed statutory changes to each record, and aggregate the results to estimate revenue and distributional effects. The PUF is available for purchase through the University of Michigan's Institute for Social Research and directly from the IRS Statistics of Income division. Access requires an application describing the research purpose; the data is licensed for research use and may not be redistributed.
Researchers requiring access to the full return population — not the PUF sample — for analysis that cannot be conducted on tabulations or the PUF can apply for access to the IRS data enclave in Washington, D.C., under the provisions of Internal Revenue Code Section 6103(j), which permits Treasury and IRS to share return information with specific statistical agencies for research and statistical purposes. Access at this level is exceptional and subject to strict data security requirements.
Python: Parsing SOI AGI Tables to Compute Income and Tax Shares
The IRS posts the individual complete report as an Excel workbook at a predictable URL on irs.gov. The following script downloads that workbook, parses the AGI class table, and computes the share of total AGI and total income tax attributable to each bracket, along with the effective income tax rate for each bracket. This replicates the core calculation underlying published analyses of income concentration and tax progressivity.
import requests
import pandas as pd
import io
# IRS SOI Individual Complete Report -- Table 1 (All Returns)
# The IRS posts Excel files at a stable URL pattern; adjust the year as new data is released.
# 2022 tax-year data (most recent as of this writing) is published ~2024.
YEAR = "22" # last two digits of tax year
SOI_URL = (
"https://www.irs.gov/pub/irs-soi/"
+ YEAR
+ "in01an.xlsx"
)
resp = requests.get(SOI_URL, timeout=120)
resp.raise_for_status()
# The Excel file has multiple header rows; row index 0 is a merged title,
# row index 1 is the column header. Skip the first row.
raw = pd.read_excel(
io.BytesIO(resp.content),
sheet_name=0,
header=1,
dtype=str,
)
# Drop rows that are entirely NaN (spacer rows in the IRS layout)
raw = raw.dropna(how="all").reset_index(drop=True)
# The first column is the AGI class label; rename for clarity
raw = raw.rename(columns={raw.columns[0]: "agi_class"})
# IRS tables repeat "Total" rows -- keep only AGI-class detail rows
# (they contain numeric ranges or "Under" / "or more")
detail = raw[
raw["agi_class"].str.contains(r"d", na=False)
].copy()
# Identify the two key columns by partial name match
# (exact column names vary slightly across years)
def find_col(df, fragment):
matches = [c for c in df.columns if fragment.lower() in str(c).lower()]
return matches[0] if matches else None
num_returns_col = find_col(detail, "Number of returns")
agi_col = find_col(detail, "Adjusted gross income")
tax_col = find_col(detail, "Total income tax")
for col in [num_returns_col, agi_col, tax_col]:
if col:
detail[col] = pd.to_numeric(detail[col], errors="coerce")
detail = detail.dropna(subset=[num_returns_col, agi_col, tax_col])
# Compute shares
total_agi = detail[agi_col].sum()
total_tax = detail[tax_col].sum()
detail["agi_share_pct"] = (detail[agi_col] / total_agi * 100).round(2)
detail["tax_share_pct"] = (detail[tax_col] / total_tax * 100).round(2)
# Effective tax rate for each AGI class
detail["effective_rate_pct"] = (detail[tax_col] / detail[agi_col] * 100).round(2)
print(
detail[["agi_class", "agi_share_pct", "tax_share_pct", "effective_rate_pct"]]
.to_string(index=False)
)
Several implementation notes. The IRS Excel files use multi-row headers with merged cells; skipping the first row and using the second as the column header avoids the merged-cell problem in most years, but exact column names shift slightly across publications. The find_col helper performs a case-insensitive substring match so the script remains functional when the IRS adjusts column labels. Dollar amounts in the SOI tables are reported in thousands, so raw values must be multiplied by 1,000 before being compared to external sources that report full dollar figures. The URL pattern shown — in01an.xlsx for Table 1, All Returns — follows the IRS naming convention for recent years; older years used slightly different naming conventions and may require adjustment.
Extending this analysis to plot cumulative income and tax share curves — effectively Lorenz curves for pre-tax income and tax burdens — requires computing the cumulative sum of AGI share and tax share from the lowest to the highest bracket and plotting against the cumulative share of returns. The effective rate column already produced by the script shows the rate-at-each-bracket profile that demonstrates the progressive structure of the rate schedule: effective rates rise from near zero (and negative, for EITC-eligible filers) at the bottom to the mid-20s percentage range at the top.
Connecting SOI to Other Federal Data
The SOI fits into a broader ecosystem of federal economic data in several ways. The BEA National Income and Product Accounts provide total wages and salaries, corporate profits, and proprietors' income for the macro economy; the SOI provides the distributional detail within those aggregates. The BEA total for wages and salaries will differ from the SOI aggregate because BEA covers all workers while the SOI covers only those required to file federal returns, but the two can be reconciled as a consistency check on income estimates.
The SOI corporate data complements SEC financial reporting for publicly traded companies. The SOI covers all C-corporations including privately held ones — the majority of corporate filers by count, though not by asset value — whereas SEC EDGAR financial statements cover only reporting companies. For aggregate effective rate analysis, the SOI is more comprehensive; for firm-level or industry-specific analysis of large public companies, SEC filings provide detail the SOI cannot.
The partnership and sole proprietorship SOI publications close a major gap in federal business statistics. The Census Bureau's economic censuses and County Business Patterns cover establishments rather than tax units and have limited income data. The IRS partnership SOI tracks the total income, deductions, and capital accounts of partnerships, LLCs taxed as partnerships, and S-corporations — the organizational forms that have displaced C-corporations for most small and medium-sized businesses since the 1980s. Understanding the full business income distribution requires reading the SOI individual and partnership publications together, since pass-through income appears on individual returns in the SOI individual tables rather than the corporate ones.
Limitations and Practical Considerations
The two-year publication lag is the most significant operational limitation of the SOI for policy-relevant analysis. Tax year data becomes available approximately two years after the tax year closes, meaning that analysis of economic conditions requires working with data that reflects circumstances two to three years prior. For questions about the current distribution of income or tax burdens, the SOI provides a baseline that must be combined with more current but less precise sources.
The SOI tabulations measure reported income, not economic income. Capital gains are reported when realized — when an asset is sold — rather than when they accrue. In years with depressed equity markets, reported capital gains and the measured income share of the top 1% fall substantially even if unrealized wealth concentration has continued to grow. This distinction between income flows and wealth stocks is a recurring source of interpretive difficulty in analyses based on SOI data.
AGI is a tax concept, not an economic income concept. It excludes employer contributions to retirement plans, the employer share of payroll taxes, the imputed rental value of owner-occupied housing, and certain transfer payments, among other items. Researchers comparing SOI-based income shares to alternative measures should be precise about which income definition is being used and how the differences affect the comparison.
Finally, the SOI counts tax filing units, not individuals. A married couple filing jointly constitutes one return regardless of how income is divided within the household. The number of returns is therefore not directly comparable to the number of adults or households in census data without adjustments for filing status and household composition.
The BEA National Income and Product Accounts provide the macroeconomic income aggregates that the SOI distributional data disaggregates — the two programs are complementary lenses on the same underlying income flows. See BEA GDP Accounts: National Income, Output, and the NIPA Framework.
The IRS Form 990 public disclosure program provides a parallel view of the tax system for nonprofit organizations — entities exempt from income tax whose financial disclosures illuminate the third sector that the SOI individual and corporate tables exclude. See IRS Form 990: The Federal Nonprofit Financial Disclosure Dataset.
For corporate income and financial performance data on publicly traded companies — the segment of the corporate sector where SEC filings provide granularity unavailable in the SOI aggregate tables — see SEC EDGAR Financial Statements: Public Company Income and Balance Sheet Data.