Technical writing

HMDA: The Home Mortgage Disclosure Act Dataset Behind Every Redlining Investigation

· AI Analytics
Federal DataCFPBMortgageFair Lending

Every mortgage application submitted to a covered lender in the United States gets recorded in a federal dataset that has been accumulating since 1975. The Home Mortgage Disclosure Act Loan Application Register — HMDA LAR — is the primary evidence base for redlining enforcement, fair-lending litigation, Community Reinvestment Act examinations, and academic research on racial wealth gaps. No other public dataset captures mortgage outcomes, borrower demographics, loan pricing, and property geography at the scale HMDA does: roughly nine million records per year, covering approximately 80–85% of all US mortgage applications.

What HMDA Is and Why It Exists

Congress enacted the Home Mortgage Disclosure Act in 1975 after years of community advocacy documenting that banks were systematically refusing to lend in minority neighborhoods — a practice called redlining, named after the literal red lines that Home Owners' Loan Corporation appraisers drew on maps in the 1930s to mark neighborhoods they deemed too risky for federally backed mortgage insurance. HMDA created a mandatory disclosure regime: most mortgage lenders must publicly report every loan application, origination, and denial, along with the borrower's race, ethnicity, sex, and income, the property's location, and the loan's terms.

The Federal Reserve Board originally administered HMDA. The Dodd-Frank Wall Street Reform and Consumer Protection Act of 2010 transferred rulemaking and supervisory authority to the newly created Consumer Financial Protection Bureau. The CFPB issued a substantially revised HMDA Final Rule in 2015 that took effect for data collected beginning in 2018. That rule expanded the required fields significantly — adding credit scores, debt-to-income ratios, loan-to-value ratios, automated underwriting system results, and pricing data — making post-2018 HMDA records among the richest public mortgage datasets in the world.

The filing threshold for HMDA coverage is institution-specific and has changed over time. As of the 2020 rule, depository institutions that originated fewer than 25 closed-end mortgage loans in each of the two prior calendar years are exempt. Non-depository lenders face a slightly different threshold. The exemptions mean that very small community lenders fall outside HMDA, but every major bank, credit union, and mortgage company is covered, and coverage of total mortgage volume is close to complete.

What Is in a HMDA Record

The post-2018 HMDA Loan Application Register is a flat file with over 100 fields per record. The key fields, organized by category, are:

Institution identifiers. Each record carries the reporting institution's Legal Entity Identifier (LEI), its FDIC certificate number or RSSD ID where applicable, and the calendar year of the activity. These allow the record to be linked to call report data, CRA performance evaluations, and other bank regulatory filings.

Loan characteristics. Loan type is coded as conventional (1), FHA-insured (2), VA-guaranteed (3), or USDA Rural Housing Service (4). Loan purpose distinguishes home purchase (1), home improvement (2), refinancing (31), cash-out refinancing (32), and other purpose (4). Construction method separates site-built properties from manufactured homes. Occupancy type classifies the property as a principal residence (1), second home (2), or investment property (3).

Action taken. The action taken field is the outcome variable for fair lending analysis. Code 1 means the loan was originated; code 2 means it was approved but not accepted by the applicant; code 3 means it was denied by the financial institution; code 4 means the applicant withdrew; code 5 means the file was closed for incompleteness; code 6 means the institution purchased a previously originated loan; codes 7 and 8 cover preapproval requests that were denied or approved but not accepted. For most disparity analysis, the comparison is between originated (1) and denied (3), excluding withdrawn applications and preapproval requests.

Property location. HMDA captures the state, county (5-digit FIPS), census tract, and metropolitan statistical area (MSA) of the property. Census tract is the geographic unit of analysis for redlining investigations. The property address is not disclosed, but the census tract is specific enough to map lending patterns at the neighborhood level.

Applicant demographics. Race is reported using five OMB racial categories — American Indian or Alaska Native, Asian, Black or African American, Native Hawaiian or Other Pacific Islander, White — with two-character subcategories for Asian and Pacific Islander origins. Ethnicity is reported separately as Hispanic or Latino versus not Hispanic or Latino. Sex is reported as male, female, or not provided. The post-2018 rule added age ranges for both the applicant and co-applicant. The “derived race” and “derived ethnicity” fields aggregate the raw race/ethnicity codes into a single clean classification for analysis. Income is the applicant-stated gross annual income in thousands of dollars, as submitted on the loan application form; it is not verified income.

Credit and underwriting fields (post-2018). The expanded HMDA rule added the credit score model type (e.g., FICO, VantageScore) and the credit score value — though the score itself is reported as a range to limit re-identification risk. It also added the combined loan-to-value ratio (CLTV), the debt-to-income ratio (DTI), and the automated underwriting system (AUS) used — Desktop Underwriter (Fannie Mae), Loan Prospector / Loan Product Advisor (Freddie Mac), or proprietary systems. These fields substantially improve the ability to assess whether a denial was driven by legitimate underwriting criteria or by discriminatory treatment.

Pricing data. The rate spread field captures the difference in percentage points between the loan's APR and the Average Prime Offer Rate (APOR) for a comparable loan. Under the 2018 rule, rate spread is disclosed for all covered closed-end loans where the rate spread exceeds 150 basis points for first liens or 350 basis points for subordinate liens. The HOEPA status field flags loans that meet the high-cost loan definition under the Home Ownership and Equity Protection Act. Together these fields identify higher-priced lending patterns that may indicate predatory pricing or steering.

Secondary market purchaser. The purchaser type field records who purchased the loan on the secondary market after origination: Fannie Mae (1), Freddie Mac (2), Farmer Mac (3), Ginnie Mae (4), commercial bank or savings institution (5), or other entity (6). This data reveals whether lenders in specific geographies are originating for portfolio or for the GSEs, which has implications for credit access.

Scale and Coverage

HMDA is one of the largest publicly available loan-level datasets in the United States. The 2022 data release, published by the CFPB in August 2023, contained approximately 9.4 million loan/application records from roughly 5,100 reporting institutions. The dataset covers between 80% and 85% of all US mortgage applications in most years; the excluded share is concentrated in small community banks and credit unions that fall below the reporting threshold.

The CFPB publishes two versions of the annual dataset. The full HMDA LAR is available to researchers who register through the FFIEC's HMDA Platform and agree to data use terms. The modified LAR — the public version — has certain fields suppressed or modified to reduce re-identification risk, including replacing specific age values with ranges, suppressing census tract information in counties with fewer than 1,000 applications, and removing exact loan amounts above $1 million. For most fair-lending and geographic analyses, the modified LAR is sufficient. The CFPB typically publishes the prior year's data in August.

Redlining Investigations: How HMDA Is Used in Practice

The CFPB and the Department of Justice use HMDA as the primary quantitative evidence base for modern redlining enforcement. The investigative methodology follows a consistent framework. Examiners map application and origination rates by census tract, overlaid against census-tract racial composition data from the American Community Survey. They identify whether minority-majority census tracts within a lender's Community Reinvestment Act assessment area have substantially lower application rates than comparable majority-White tracts, controlling for income and housing characteristics. Geographic clustering of low application rates in minority neighborhoods, combined with evidence that the lender concentrated its marketing and branching in majority-White areas, supports a redlining finding.

The enforcement record illustrates the stakes. In 2021, the DOJ reached a $5 million settlement with Trustmark National Bank over redlining in the Memphis MSA — the first major redlining settlement since the 2015 HMDA rule expansion. In 2023, the DOJ secured an $8.5 million settlement with Cadence Bank over redlining in Houston and Birmingham. Most significantly, the 2023 settlement with City National Bank of New Jersey for $31 million — the largest redlining settlement in DOJ history — relied substantially on HMDA data showing that the bank received virtually no applications from Black and Hispanic borrowers in areas where it operated.

Academic research complements the enforcement record. Multiple studies have documented that census tracts that received “hazardous” or “definitely declining” grades from the Home Owners' Loan Corporation in the 1930s still exhibit statistically lower homeownership rates, lower home values, and lower mortgage origination rates in modern HMDA data. The persistence of the spatial pattern across eight decades suggests that historical redlining locked in disinvestment cycles that contemporary lending patterns have not reversed.

Denial Reasons and Differential Treatment

HMDA records include up to four denial reason codes for denied applications. The standard codes are: debt-to-income ratio (1), employment history (2), credit history (3), collateral / property value (4), insufficient cash for down payment or closing costs (5), unverifiable information (6), incomplete application (7), mortgage insurance denied (8), and other (9). Lenders are not required to provide denial reasons for all application types, and some lenders leave the field blank even when required to complete it.

Analyzing denial reasons by applicant race reveals patterns that are central to fair-lending litigation. If Black applicants are denied at a higher rate on DTI grounds than White applicants with similar income levels, that disparity suggests either that the DTI threshold is applied inconsistently or that income verification procedures differ by race. If denial reasons differ in ways that cannot be explained by the credit and underwriting fields added in 2018, the residual disparity is evidence of disparate treatment rather than disparate impact from neutral policies.

One important caveat: denial reason codes are lender-reported. Lenders who are engaged in discriminatory practices have an obvious incentive to record facially neutral denial reasons. The post-2018 addition of DTI, CLTV, and credit score fields allows examiners to cross-check whether the stated denial reason is consistent with the underwriting data — a denied applicant with a 680 credit score and 35% DTI who is listed as denied for “credit history” is anomalous in a way that the pre-2018 data could not detect.

The HMDA Explorer and CFPB API

The CFPB hosts an interactive HMDA Explorer at ffiec.cfpb.gov/hmda that allows users to filter records by year, institution, geography, loan type, action taken, and applicant characteristics, and to download the results as CSV. The underlying API is the HMDA Platform Data Browser API, whose base URL is https://ffiec.cfpb.gov/v2/data-browser-api/view/csv. It accepts query parameters for years, states (2-digit FIPS), counties (5-digit FIPS), metropolitan statistical areas, LEIs, action taken codes, loan types, loan purposes, and applicant race. The API returns a flat CSV of all matching records, making it straightforward to pull a targeted slice of the national dataset without downloading the multi-gigabyte full LAR file.

For bulk work — national denial rate analysis, cross-MSA comparisons, multi-year trends — the full modified LAR snapshot is the better access method. The CFPB publishes it as a single gzip-compressed or pipe-delimited flat file. The 2022 snapshot is approximately 6 GB uncompressed. A full national analysis typically requires reading it in chunks or filtering by state on download rather than loading the entire file into memory.

Pre-2018 vs. Post-2018 Data: A Structural Break

The 2015 HMDA Final Rule created a structural break in the dataset that analysts must account for. Pre-2018 HMDA records include loan amount, action taken, applicant demographics, property location, loan type, loan purpose, and rate spread for high-priced loans above the HOEPA threshold. They do not include credit scores, DTI, CLTV, AUS results, or detailed pricing data.

This means that multi-year analysis spanning the 2017–2018 transition requires either restricting to fields common to both periods or building separate models for pre- and post-2018 data. Denial rate disparity analysis is feasible across the entire series using the common fields. But any analysis that uses credit score or DTI as a control variable — as a lender would in a disparate treatment defense — can only be performed on 2018 and later data.

For long-run historical analysis — documenting that denial rate disparities persisted through the 2010s, declined during the refinancing boom of 2020–2021, and rose again as rates increased in 2022 — the pre-2018 series is essential. The CFPB maintains historical HMDA files going back to 1990 in a consistent format, though the very early files (1990 to 2003) have slightly different field structures and geographic coding.

HMDA and the Community Reinvestment Act

Federal banking regulators — the OCC, the Federal Reserve, and the FDIC — use HMDA data as the primary quantitative input to CRA examinations. The CRA requires covered institutions to meet the credit needs of their entire communities, including low- and moderate-income (LMI) census tracts. The lending test component of a CRA examination assesses whether the institution's mortgage originations are geographically distributed in proportion to the population and credit-eligible borrowers in its assessment area, with particular attention to LMI tracts.

CRA ratings — Outstanding, Satisfactory, Needs to Improve, or Substantial Noncompliance — affect a bank's ability to obtain regulatory approval for mergers and acquisitions. A Needs to Improve rating can block or substantially delay a proposed merger. Community organizations routinely file CRA challenges using HMDA data to demonstrate that a bank seeking merger approval has inadequate lending in minority or LMI neighborhoods, and regulators are required to take those challenges into account.

The 2023 CRA final rule — a joint rulemaking by the OCC, Fed, and FDIC — substantially revised the CRA evaluation framework and the metrics used to assess lending test performance. HMDA remains the foundation of those metrics, though the weighting and comparison benchmarks have changed. Analysts working on CRA compliance or CRA challenges need to understand both the HMDA data structure and the specific performance benchmarks published by the regulators.

Python: Racial Denial Rate Disparity by County

The following script downloads the HMDA modified LAR for Illinois via the CFPB Data Browser API, filters to conventional home purchase applications, and computes the denial rate for Black and White applicants in each county. It then calculates the disparity ratio (Black denial rate divided by White denial rate) and identifies the ten counties with the largest racial gap — the standard first-cut analysis in a fair-lending examination.

import requests
import pandas as pd
import io

# Download the HMDA modified LAR snapshot for a given year and state
# The CFPB Data Browser API returns a CSV of all loan-level records
# filtered to your parameters.

YEAR = "2023"
STATE_CODE = "17"  # Illinois FIPS code (2-digit)

url = (
    "https://ffiec.cfpb.gov/v2/data-browser-api/view/csv"
    "?years=" + YEAR
    + "&states=" + STATE_CODE
    + "&actions_taken=1,3"          # originated (1) and denied (3)
    + "&loan_purposes=1"            # home purchase only
    + "&loan_types=1"               # conventional only
)

resp = requests.get(url, timeout=300)
resp.raise_for_status()

df = pd.read_csv(io.StringIO(resp.text), dtype=str, low_memory=False)

# Numeric conversion for key columns
for col in ["loan_amount", "income", "action_taken", "county_code"]:
    if col in df.columns:
        df[col] = pd.to_numeric(df[col], errors="coerce")

# Keep rows with a valid county code and a usable derived_race value
df = df.dropna(subset=["county_code", "derived_race"])
df["county_code"] = df["county_code"].astype(int)

# Focus on two groups for the disparity analysis
races_of_interest = ["White", "Black or African American"]
df_focus = df[df["derived_race"].isin(races_of_interest)].copy()

# Action taken: 1 = originated, 3 = denied
df_focus["originated"] = (df["action_taken"] == 1).astype(int)
df_focus["denied"]     = (df["action_taken"] == 3).astype(int)

# Aggregate by county and race: count applications, originations, denials
agg = (
    df_focus.groupby(["county_code", "derived_race"])
    .agg(
        applications=("action_taken", "count"),
        originations=("originated", "sum"),
        denials=("denied", "sum"),
    )
    .reset_index()
)

# Compute denial rate per group
agg["denial_rate"] = agg["denials"] / agg["applications"]

# Pivot so White and Black denial rates are side by side
pivot = agg.pivot_table(
    index="county_code",
    columns="derived_race",
    values=["applications", "denial_rate"],
)
pivot.columns = ["_".join(c).strip() for c in pivot.columns]
pivot = pivot.reset_index()

# Rename for readability
pivot = pivot.rename(columns={
    "denial_rate_Black or African American": "denial_rate_black",
    "denial_rate_White": "denial_rate_white",
    "applications_Black or African American": "apps_black",
    "applications_White": "apps_white",
})

# Drop counties where either group has fewer than 30 applications (unreliable rates)
pivot = pivot[
    (pivot["apps_black"] >= 30) & (pivot["apps_white"] >= 30)
].copy()

# Compute the disparity ratio: Black denial rate divided by White denial rate
pivot["disparity_ratio"] = pivot["denial_rate_black"] / pivot["denial_rate_white"]

# Top 10 counties by largest racial denial-rate gap (disparity ratio)
top10 = pivot.nlargest(10, "disparity_ratio")[
    ["county_code", "apps_black", "apps_white",
     "denial_rate_black", "denial_rate_white", "disparity_ratio"]
].reset_index(drop=True)

# Format for display (avoid f-string format specs inside template literal)
top10["denial_rate_black_pct"] = top10["denial_rate_black"].apply(
    lambda x: str(round(x * 100, 1)) + "%"
)
top10["denial_rate_white_pct"] = top10["denial_rate_white"].apply(
    lambda x: str(round(x * 100, 1)) + "%"
)
top10["disparity_ratio_fmt"] = top10["disparity_ratio"].apply(
    lambda x: str(round(x, 2)) + "x"
)

print(
    top10[
        ["county_code", "denial_rate_black_pct",
         "denial_rate_white_pct", "disparity_ratio_fmt"]
    ].to_string(index=False)
)

The API filters on actions_taken=1,3 to return only originated and denied applications, excluding withdrawn files and incomplete applications that could dilute the denial rate calculation. The derived_race field is the CFPB's consolidated race variable, which resolves multi-race and co-applicant complications into a single classification per record. The 30-application threshold per county per group eliminates cells where the denial rate would be statistically unreliable.

A disparity ratio above 2.0 — meaning Black applicants are denied at twice the rate of White applicants — is a common trigger for deeper investigation in CFPB fair-lending examinations. The ratio alone does not establish discrimination; it is the starting point for controlling on credit score, DTI, income, loan amount, and property characteristics, which the post-2018 fields make possible within HMDA itself rather than requiring a separate data pull.

Connecting HMDA to Other Federal Datasets

HMDA's geographic spine — census tract and county FIPS codes — makes it straightforward to join against other federal datasets. The most common pairings:

Limitations and Analytical Cautions

HMDA records what lenders report, not independently verified facts. Income is applicant-stated at application, before verification. Credit scores are reported by the lender using whatever credit score model the institution uses, and the model varies across lenders, making cross-lender credit score comparisons imprecise. DTI and CLTV are calculated at application using the lender's methodology, which may differ across institutions.

The dataset records applications, not applicants. A borrower who applies to three lenders and is denied twice before being originated by the third will appear three times in HMDA — twice as denied, once as originated — which inflates both origination and denial counts. Researchers studying denial rates should be aware that the denominator includes repeat applications from the same borrower.

HMDA does not capture loans below a certain amount threshold, loans made by exempt institutions, or certain manufactured home loans that fall outside the coverage definition. The 80–85% coverage figure is a market-wide estimate; coverage rates in specific MSAs or for specific loan products can be higher or lower. Small markets where community banks dominate may have meaningfully lower HMDA coverage than large metropolitan areas where large national lenders are active.

Finally, the eight-month publication lag — 2023 data published in August 2024 — means HMDA is a lagging indicator of credit conditions. For monitoring mortgage market access in real time, HMDA is supplemented by the MBA Weekly Mortgage Application Survey and the FHFA Monthly Interest Rate Survey, neither of which has HMDA's demographic granularity.


The CFPB's enforcement actions database documents the redlining settlements, fair-lending orders, and civil money penalties that flow from HMDA investigations. See CFPB Enforcement Actions: The Federal Consumer Finance Penalty Database.

Home price appreciation at the county and MSA level provides essential context for understanding how HMDA-documented lending gaps compound into racial wealth disparities. See FHFA House Price Index: The Authoritative County-Level Home Price Dataset.

The HUD Low Income Housing Tax Credit database documents affordable housing development by census tract — a natural complement to HMDA for studying credit access in LMI neighborhoods. See HUD LIHTC Database: Mapping Every Affordable Housing Tax Credit Project.