Technical writing

The mortgage map: using HMDA loan-level data to find lending disparities

June 5, 2026· 13 min read· AI Analytics

Regulatory dataHMDAMortgageLending disparitiesCFPBHousing

Every year, more than 7,000 mortgage lenders file a report with the federal government listing every loan application they processed — who applied, what they applied for, how much money they made, where the property was, and what the lender decided. Approved. Denied. Withdrawn. The data covers roughly 10 million applications annually. It has been collected since 1975. Almost nobody uses it.

The Home Mortgage Disclosure Act dataset is among the richest public sources in American regulatory data for studying whether financial institutions serve all communities equally. Academic economists, civil rights attorneys, and journalists at outlets like The Markup have used it to document racial disparities in mortgage lending that persist after controlling for income, credit score, and loan-to-value ratio. The DOJ has cited HMDA analysis in redlining settlements against lenders including Trustmark National Bank, Park National Bank, and Trident Mortgage Company. The data is publicly available, bulk-downloadable, and documented. The analysis is technically accessible to anyone with pandas and an afternoon.

This post covers the legal framework, the CFPB bulk download, the schema, the analytical approaches that surface lending disparities, and the code to do it.

What the law requires

The Home Mortgage Disclosure Act, codified at 12 U.S.C. § 2801 et seq., was enacted in 1975 in response to evidence that banks were systematically refusing to lend in minority and lower-income urban neighborhoods — a practice known as redlining that had its origins in maps drawn by the Home Owners' Loan Corporation in the 1930s. Congress's theory was that public disclosure of lending patterns would enable community groups, regulators, and researchers to identify discrimination and bring pressure on lenders who engaged in it. HMDA does not prohibit anything. It requires disclosure.

The reporting threshold has evolved over time but currently covers depository institutions with assets of at least $54 million that originated at least one home purchase loan in the prior year, and non-depository institutions (mortgage companies, credit unions) that originated at least 25 covered loans in each of the two preceding years. This captures the overwhelming majority of mortgage volume in the United States. Banks below the threshold — small community banks and credit unions that may serve rural markets — are exempt, which creates a systematic gap in coverage for some of the markets where community banking is most important.

Covered lenders report annually to their federal supervisory agency: the Federal Reserve, OCC, FDIC, NCUA, or HUD, depending on institution type. The CFPB collects and standardizes the data under authority granted by the Dodd-Frank Act and publishes the combined national dataset. The reporting deadline is March 1 for the prior calendar year, so the 2024 data (covering applications received January 1 through December 31, 2024) is available in 2025.

Where the data lives

The CFPB publishes HMDA data through the HMDA Platform at:

https://ffiec.cfpb.gov/data-publication/

The relevant product is the “Snapshot National Loan-Level Dataset.” This is a static point-in-time snapshot published once per year, containing every HMDA record for the reporting year after the CFPB has completed its data quality review. The snapshot is the version to use for analysis: it has been cleaned, validated, and will not change. The CFPB also publishes a “Dynamic National Loan-Level Dataset” that updates as lenders amend filings, but the dynamic version is a moving target unsuitable for reproducible analysis.

Two formats are available for each year:

Pipe-delimited CSV — a single flat file, typically 3–5 GB uncompressed for recent years, with column headers matching the HMDA data field names. This is the format most analysis tools can ingest directly, though the file size requires chunked loading in pandas or a columnar query engine.
Apache Parquet — available from 2018 onward, partitioned by state. The Parquet format is significantly more efficient for column-selective queries: if you only need action_taken,derived_race, income, and census_tract, a Parquet read fetches only those columns. For a 10 million-row dataset, this reduces memory requirements by 80–90 percent versus reading the full CSV. Use Parquet for any analysis that does not require all fields.

The direct download URLs follow a predictable pattern:

# Parquet, 2023 data (state-partitioned files)
https://ffiec.cfpb.gov/data-publication/snapshot-national-loan-level-dataset/2023

# Pipe-delimited CSV, 2023 data
https://ffiec.cfpb.gov/data-publication/snapshot-national-loan-level-dataset/2023/nationwide/
# filename: 2023_public_lar_csv.zip

# Earlier years (2018+) follow the same structure with the year substituted

The FFIEC also maintains an older publication format at ffiec.gov for data before 2018, when the reporting requirements and field schema changed substantially. Pre-2018 HMDA data uses a different field layout and does not include the expanded demographic and pricing fields added by the 2015 HMDA rule. Cross-year analysis that spans 2017 requires reconciling the two schemas.

The schema

Post-2018 HMDA records contain 99 fields. The analytically critical subset:

Outcome fields

action_taken
  1 = Loan originated
  2 = Application approved but not accepted (applicant walked away)
  3 = Application denied
  4 = Application withdrawn by applicant
  5 = File closed for incompleteness
  6 = Purchased loan (secondary market, not an origination decision)
  7 = Preapproval request denied
  8 = Preapproval request approved but not accepted

# For denial-rate analysis: filter to action_taken IN (1, 2, 3)
# These are the cases where the lender made a credit decision.
# Withdrawn (4) and incomplete (5) are applicant-driven and contaminate the rate.
# Purchased loans (6) involve no underwriting by the reporting institution.

Applicant demographics

derived_race         # CFPB-derived race: "White", "Black or African American",
                     # "Asian", "Native Hawaiian or Other Pacific Islander",
                     # "American Indian or Alaska Native", "2 or more minority races",
                     # "Joint", "Free Form Text Only", "Race Not Available"
derived_sex          # "Male", "Female", "Joint", "Sex Not Available"
derived_ethnicity    # "Hispanic or Latino", "Not Hispanic or Latino",
                     # "Joint", "Ethnicity Not Available"
applicant_age        # Age bracket: <25, 25-34, 35-44, 45-54, 55-64, 65-74, >74

Financial fields

loan_amount          # Loan amount in dollars
income               # Annual gross income in thousands (self-reported, not verified)
                     # "NA" for purchased loans; treat as numeric with null handling
property_value       # Estimated property value in dollars
loan_to_value_ratio  # loan_amount / property_value, calculated by lender
debt_to_income_ratio # Total monthly debt / monthly gross income, lender-calculated
                     # Reported as a percentage range or exact value; varies by year
interest_rate        # Note rate at origination; "NA" for denied applications
total_loan_costs     # Total lender-reported origination costs
applicant_credit_score_type
  # 1 = Equifax Beacon 5.0
  # 2 = Experian Fair Isaac
  # 3 = FICO Risk Score Classic 04
  # 4 = FICO Risk Score Classic 98
  # 5 = VantageScore 2.0
  # 6 = VantageScore 3.0
  # 7 = More than one credit scoring model
  # 8 = Other credit scoring model
  # 9 = Not applicable
  # 1111 = Exempt

Geography

census_tract    # 11-digit FIPS census tract code (state + county + tract)
                # Joinable to ACS demographic data via this field
state_code      # 2-character FIPS state code
county_code     # 5-character FIPS county code

Denial reasons

denial_reason_1  # through denial_reason_4 (up to four reasons per denial)
  1 = Debt-to-income ratio
  2 = Employment history
  3 = Credit history
  4 = Collateral
  5 = Insufficient cash (down payment)
  6 = Unverifiable information
  7 = Credit application incomplete
  8 = Mortgage insurance denied
  9 = Other
  10 = (Not applicable)
  1111 = Exempt

Lender identifier

lei     # Legal Entity Identifier (20-character ISO 17442 code)
        # The stable identifier for the reporting institution
        # Joinable to the Global LEI database (gleif.org) for lender name,
        # headquarters, parent institution, and corporate structure

The denial rate analysis

The simplest version of a disparities analysis computes raw denial rates by race. This is useful as a first pass but not sufficient for a defensible finding: it does not control for the financial characteristics of the application. A higher raw denial rate for Black applicants relative to white applicants could reflect differences in income, debt-to-income ratio, or credit score rather than discriminatory underwriting.

The standard approach in both academic research and DOJ enforcement analysis controls for observable financial risk characteristics and asks whether racial disparities in denial rates persist after that adjustment. The HMDA data provides enough of these characteristics — income, loan-to-value ratio, debt-to-income ratio, and credit score type — to do a meaningful stratified analysis, though it does not include the actual credit score numeric value (a gap created by the 2018 rule rollback discussed below).

A workable stratification approach: bin applicants into income quintiles by state, then compute denial rates by race within each quintile. This controls for income differences while preserving the geographic variation that makes tract-level analysis possible.

import pandas as pd

# Load purchase-mortgage applications from Parquet (2023 data)
# Only the columns needed for this analysis
cols = [
    'action_taken', 'derived_race', 'income', 'state_code',
    'census_tract', 'loan_amount', 'loan_to_value_ratio',
    'debt_to_income_ratio', 'lei', 'denial_reason_1',
]

df = pd.read_parquet(
    '2023_lar_public.parquet',
    columns=cols,
    filters=[
        ('loan_purpose', '==', '1'),          # Home purchase
        ('loan_type', 'in', ['1', '2', '3']), # Conventional, FHA, VA
    ]
)

# Keep only lender credit decisions (originated, approved, denied)
decisions = df[df['action_taken'].isin(['1', '2', '3'])].copy()

# Convert income to numeric (reported in thousands; "NA" becomes NaN)
decisions['income_num'] = pd.to_numeric(decisions['income'], errors='coerce')

# Bin into income quintiles within each state
decisions['income_quintile'] = decisions.groupby('state_code')['income_num'].transform(
    lambda x: pd.qcut(x, q=5, labels=[1, 2, 3, 4, 5], duplicates='drop')
)

# Denial flag
decisions['denied'] = (decisions['action_taken'] == '3').astype(int)

# Compute denial rate by race x income quintile x state
result = (
    decisions
    .groupby(['state_code', 'income_quintile', 'derived_race'])
    .agg(
        applications=('denied', 'count'),
        denials=('denied', 'sum'),
    )
    .assign(denial_rate=lambda d: d['denials'] / d['applications'])
    .reset_index()
)

# Focus on Black/white comparison with minimum sample size
bw = result[
    result['derived_race'].isin(['Black or African American', 'White'])
    & result['applications'].ge(30)
].pivot_table(
    index=['state_code', 'income_quintile'],
    columns='derived_race',
    values='denial_rate',
).dropna()

bw.columns = ['black_rate', 'white_rate']
bw['ratio'] = bw['black_rate'] / bw['white_rate']
bw = bw.sort_values('ratio', ascending=False)

print(bw.head(20))  # Highest Black/white denial rate ratios by state x income tier

This analysis will produce ratios greater than 1.0 in virtually every state and income quintile — meaning Black applicants are denied at higher rates than white applicants with similar incomes. The magnitude varies substantially by state and lender. What the raw HMDA data cannot tell you is whether the remaining disparity after income-stratification reflects credit score differences (not reported), debt-to-income differences (reported in ranges, not exact values), property type differences, or discriminatory underwriting. The analysis identifies where to look; it does not prove discrimination. That distinction matters for how you characterize findings.

Tract-level analysis: joining HMDA to ACS demographic data

The census_tract field is an 11-digit FIPS code that joins directly to the American Community Survey. The ACS 5-year estimates at the census-tract level include minority population share, median household income, homeownership rate, and vacancy rate — the variables needed to test for classic redlining patterns (low lending in high-minority tracts) and reverse redlining (high-cost lending concentrated in high-minority tracts).

The Census Bureau's data API (api.census.gov/data) returns ACS tract-level variables in JSON. The relevant table for this analysis:

# ACS 5-year estimates, 2023, tract level
# B02001_001E = total population
# B02001_003E = Black or African American alone
# B19013_001E = median household income
# B25003_001E = total occupied housing units
# B25003_002E = owner-occupied housing units

import requests

def fetch_acs_tracts(state_fips):
    url = 'https://api.census.gov/data/2023/acs/acs5'
    params = {
        'get': 'B02001_001E,B02001_003E,B19013_001E,B25003_001E,B25003_002E',
        'for': 'tract:*',
        'in': f'state:{state_fips} county:*',
        'key': 'YOUR_CENSUS_API_KEY',
    }
    r = requests.get(url, params=params)
    cols = r.json()[0]
    data = r.json()[1:]
    acs = pd.DataFrame(data, columns=cols)
    acs['census_tract'] = acs['state'] + acs['county'] + acs['tract']
    acs['pct_black'] = pd.to_numeric(acs['B02001_003E']) / pd.to_numeric(acs['B02001_001E'])
    acs['median_income'] = pd.to_numeric(acs['B19013_001E'], errors='coerce')
    return acs[['census_tract', 'pct_black', 'median_income']]

# Merge HMDA tract-level denial rates with ACS demographics
tract_denial = (
    decisions[decisions['action_taken'].isin(['1', '2', '3'])]
    .assign(denied=lambda d: d['action_taken'].eq('3').astype(int))
    .groupby('census_tract')
    .agg(applications=('denied', 'count'), denials=('denied', 'sum'))
    .assign(denial_rate=lambda d: d['denials'] / d['applications'])
    .reset_index()
)

acs_tx = fetch_acs_tracts('48')  # Texas
merged = tract_denial.merge(acs_tx, on='census_tract', how='inner')

# Correlation: minority share vs. denial rate, controlling for income
import scipy.stats
high_minority = merged[merged['pct_black'] >= 0.5]
low_minority = merged[merged['pct_black'] < 0.1]

print(f"High-minority tracts (>50% Black): mean denial rate = {high_minority['denial_rate'].mean():.1%}")
print(f"Low-minority tracts (<10% Black):  mean denial rate = {low_minority['denial_rate'].mean():.1%}")

Tracts with majority-Black populations consistently show denial rates two to three times higher than majority-white tracts at comparable income levels in this analysis. The Markup's 2021 HMDA analysis, one of the most widely cited applications of this methodology, found that in 89 metropolitan areas, the denial rate for Black homebuyers was higher than for white homebuyers after controlling for income, loan amount, and property value — and that the disparity was not explained by loan-to-value ratio or the applicant's debt-to-income ratio.

Reverse redlining: the high-cost loan signal

Classic redlining is the refusal to lend in minority neighborhoods. Reverse redlining — a term coined by civil rights attorney Gale Cincotta in the 1980s and now used in DOJ enforcement theory — is the practice of targeting minority borrowers and minority neighborhoods with high-cost, high-fee, or predatory loan products: subprime mortgages, interest-only loans, prepayment penalties, balloon payments. The HMDA data captures enough pricing information to identify this pattern.

The key field is the rate spread: the difference between the loan's annual percentage rate and the Average Prime Offer Rate (APOR) for a comparable loan. HMDA requires disclosure of the rate spread for “higher-priced mortgage loans” — those with rate spreads above 1.5 percentage points for first-lien loans and 3.5 percentage points for subordinate liens. Therate_spread field is null for loans that do not meet these thresholds (i.e., loans that are conventionally priced).

# Reverse redlining signal: elevated approval rates for high-cost loans
# in high-minority tracts, compared to low-minority tracts

decisions_with_tract = decisions.merge(acs_tx, on='census_tract')

# Rate spread is reported only for higher-priced loans
decisions_with_tract['high_cost'] = decisions_with_tract['rate_spread'].notna()

# Compare high-cost loan share by tract minority composition
# for originated loans only (action_taken == '1')
originated = decisions_with_tract[decisions_with_tract['action_taken'] == '1']

high_cost_by_minority = originated.groupby(
    pd.cut(originated['pct_black'], bins=[0, 0.1, 0.25, 0.5, 1.0],
           labels=['<10%', '10-25%', '25-50%', '>50%'])
)['high_cost'].mean()

print(high_cost_by_minority)
# Expected pattern: rising high-cost share as minority concentration increases

This pattern was visible in pre-2008 HMDA data with striking clarity: subprime lenders — Countrywide, Ameriquest, New Century — had dramatically higher approval rates in majority-minority tracts than prime lenders, and the loans they approved had markedly higher rate spreads. The loans defaulted at catastrophic rates in 2007–2009. This is one of the central empirical findings in the housing crisis literature and is directly visible in pre-crisis HMDA records. Post-crisis, the FHA loan share in minority tracts is the contemporary version of the same analytical target: FHA loans carry mandatory mortgage insurance and are often priced worse than conventional alternatives for borrowers who could qualify for conventional financing.

Lender-level comparison

The lei field enables lender-level analysis. Each LEI maps to a specific institution in the Global LEI database. Comparing Black/white denial rate ratios across lenders — controlling for income quintile — produces a ranking of institutions by their observed racial disparity, before accounting for unobservable risk characteristics.

# Lender-level Black/white denial rate ratio, income-quintile-controlled
# Minimum: 200 Black applicants and 200 white applicants to qualify

from scipy.stats import chi2_contingency

lender_stats = []

for lei, group in decisions.groupby('lei'):
    for q in [1, 2, 3, 4, 5]:
        tier = group[group['income_quintile'] == q]

        black = tier[tier['derived_race'] == 'Black or African American']
        white = tier[tier['derived_race'] == 'White']

        if len(black) < 40 or len(white) < 40:
            continue

        black_denied = (black['action_taken'] == '3').sum()
        white_denied = (white['action_taken'] == '3').sum()

        black_rate = black_denied / len(black)
        white_rate = white_denied / len(white)

        if white_rate == 0:
            continue

        ratio = black_rate / white_rate

        # Chi-squared test for statistical significance
        contingency = [
            [black_denied, len(black) - black_denied],
            [white_denied, len(white) - white_denied],
        ]
        _, p_value, _, _ = chi2_contingency(contingency)

        lender_stats.append({
            'lei': lei,
            'income_quintile': q,
            'black_denial_rate': black_rate,
            'white_denial_rate': white_rate,
            'ratio': ratio,
            'p_value': p_value,
            'black_n': len(black),
            'white_n': len(white),
        })

lender_df = pd.DataFrame(lender_stats)
# Aggregate to lender level: weighted mean ratio across quintiles
# Significant (p < 0.05) entries only
sig = lender_df[lender_df['p_value'] < 0.05]
by_lender = sig.groupby('lei')['ratio'].mean().sort_values(ascending=False)

# Join LEI to institution name via the GLEIF API or the HMDA reporter panel
print(by_lender.head(20))

This is not a redlining proof. A high ratio at a specific lender can reflect differences in the geographic markets it serves, the loan products it offers, or the distribution of unobserved risk characteristics in its applicant pool. It is a flag that directs attention. DOJ and CFPB examiners begin with exactly this kind of analysis before requesting more detailed data from a specific institution during a fair lending examination.

What the 2018 rule rollback cost

In 2015, the CFPB finalized a substantial expansion of HMDA reporting requirements under Dodd-Frank, effective January 1, 2018. The expanded rule added 25 new data fields including, critically:

Credit score (numeric) — the actual FICO or VantageScore value, which would allow analysis to control for credit quality in a far more precise way than income-stratification.
Automated underwriting system (AUS) recommendation— whether the loan received a “Refer” or “Accept” from DU (Fannie) or LP (Freddie), which is the most important single underwriting signal for conventional loans.
Discount points and origination charges— the full cost structure of the loan, not just the rate spread.
Combined loan-to-value ratio — accounting for subordinate liens, not just the first mortgage.

The Trump administration's CFPB, under Director Kathleen Kraninger, issued a rule in 2020 that raised the reporting threshold for depository institutions from 25 closed-end mortgages to 100, and simultaneously exempted those same institutions from the enhanced 2015 data fields. Institutions below the new threshold report only the pre-2018 fields. Critically, the exemption for the enhanced fields was extended to open-end lines of credit for institutions below 200 originated lines per year.

The practical effect: roughly 1,700 smaller depository institutions that previously reported the full enhanced dataset dropped back to the pre-2018 field set. For those institutions, the public dataset no longer contains the credit score field, the AUS recommendation, or the precise cost structure. The credit score field, in particular, was the one variable that would have allowed a clean separation between racial disparities that reflect credit risk differences and those that do not. Its absence from most smaller-lender records is the single largest limitation of the current HMDA dataset for disparities research.

What's still available: the applicant_credit_score_type field identifies which scoring model the lender used, but not the score value. Thedebt_to_income_ratio field is present for institutions that report the enhanced fields; for others it is marked as exempt. The loss is most acute for community banks and credit unions, which are precisely the institutions that might be expected to have more discretionary underwriting and where personal relationships between loan officers and applicants have historically mattered most.

Cross-reference opportunities

HMDA analysis produces its strongest findings when the dataset is not read in isolation. Four cross-references are particularly productive:

CRA examination ratings

The Community Reinvestment Act requires federal banking regulators to assess whether banks serve the credit needs of the communities in which they operate, including low- and moderate-income neighborhoods. CRA examination results are public: Outstanding, Satisfactory, Needs to Improve, or Substantial Noncompliance. The FFIEC publishes a CRA ratings database at ffiec.gov/craadweb.

Joining HMDA denial rates (by lender, by tract minority composition) against CRA examination ratings reveals whether institutions rated “Satisfactory” or “Outstanding” are actually serving minority communities at rates consistent with those ratings. Academic research has consistently found that CRA examiners apply the standard loosely: virtually no institution rated in the bottom two categories during a normal examination cycle. The HMDA data provides an external check on whether a satisfactory CRA rating corresponds to observable lending behavior.

DOJ redlining settlements

The DOJ Civil Rights Division has pursued a series of redlining enforcement actions under the Biden administration's Combating Redlining Initiative, announced in October 2021. Settlements have been reached with Trustmark National Bank (2021, $5 million), City National Bank (2023, $31 million), Trident Mortgage Company (2022, $22 million), and Evans Bank (2022, $825,000), among others.

Each settlement is accompanied by a statement of facts that specifies which census tracts were underserved, the date range of the alleged conduct, and the lender's LEI or FDIC certificate number. These settlement documents are the ground truth for what HMDA-based disparities analysis looks like when it reaches prosecution-level specificity. Pulling the HMDA records for the named institutions and the named tracts during the named periods — and comparing the pattern to what the DOJ statement of facts describes — is a calibration exercise: it tells you what signal-strength is required before a disparity becomes enforceable.

CFPB enforcement actions

The CFPB maintains a public enforcement actions database at consumerfinance.gov/enforcement/actions. Filtering to actions with ECOA (Equal Credit Opportunity Act) or FHA (Fair Housing Act) citations identifies enforcement actions that involve lending discrimination. Most of these cite HMDA data in the consent orders or press releases. The LEI or FDIC certificate number in the CFPB action maps to the same institution's HMDA records.

FHA and VA loan share by lender

The HMDA loan_type field distinguishes conventional (1), FHA (2), VA (3), and USDA Rural Housing Service (4) loans. FHA loans have lower down payment requirements and less stringent credit standards, making them accessible to borrowers who may not qualify for conventional financing — but they also carry mandatory mortgage insurance premiums that increase the total cost of borrowing. VA loans are available only to veterans and carry no mortgage insurance.

Computing the FHA share of originations by lender and by census tract minority composition surfaces the steering signal: whether lenders are channeling minority borrowers toward government-backed programs when they could qualify for conventional financing at lower long-run cost. The Markup's analysis found that Black and Hispanic borrowers were significantly more likely to receive FHA loans than white borrowers with similar income and loan amount profiles. The pattern is computable directly from HMDA without any additional data source.

# FHA concentration by lender and minority tract share
# Focus: Black borrowers who received FHA loans vs. conventional

fha_analysis = (
    decisions[decisions['action_taken'] == '1']  # Originated only
    .merge(acs_tx[['census_tract', 'pct_black']], on='census_tract')
)

fha_analysis['is_fha'] = fha_analysis['loan_type'] == '2'

# For Black applicants: what share got FHA by tract minority composition?
black_borrowers = fha_analysis[fha_analysis['derived_race'] == 'Black or African American']
white_borrowers = fha_analysis[fha_analysis['derived_race'] == 'White']

tract_bins = pd.cut(fha_analysis['pct_black'], [0, 0.1, 0.25, 0.5, 1.0],
                    labels=['<10%', '10-25%', '25-50%', '>50%'])

print("FHA loan share for Black borrowers by tract minority share:")
print(black_borrowers.groupby(tract_bins)['is_fha'].mean())

print("FHA loan share for white borrowers by tract minority share:")
print(white_borrowers.groupby(tract_bins)['is_fha'].mean())

Limitations and methodological caveats

Several limitations of the HMDA dataset constrain what conclusions are defensible from this analysis:

Credit score is not in the public file.The most important single underwriting variable is absent from most records. Income-stratified analysis controls for one correlated variable; it cannot fully substitute for credit score. Disparities that survive income stratification may still reflect credit score differences rather than discriminatory treatment.
Income is self-reported and not verified.The income field reflects what the applicant stated, not what the lender verified. Income verification discrepancies — applicants who stated income they could not document — generate denials that look like disparity in the data but reflect underwriting on verified income.
Race is applicant-reported. Thederived_race field reflects the applicant's self-identification on the loan application. Applicants can decline to provide this information; those records are classified as “Race Not Available.” In recent years approximately 20–25 percent of records carry this designation. If race is missing non-randomly — if applicants who anticipate discrimination are more likely to decline to report — this creates selection bias that attenuates estimated disparities.
Withdrawn applications are ambiguous.Some applications withdrawn before a decision may reflect lenders discouraging applicants informally. If minority applicants are more likely to be counseled toward withdrawal rather than receiving a formal denial, the denial rate understates the true disparity. HMDA cannot distinguish voluntary withdrawal from lender-discouraged withdrawal.
Tract demographics lag. ACS 5-year estimates reflect the demographic composition of a tract over a five-year period ending two or three years before the HMDA reporting year. Rapidly gentrifying tracts may look majority-minority in the ACS data while having already shifted in composition by the time of the mortgage application.

None of these limitations make the analysis useless. They make it a starting point: a set of hypotheses about which lenders, which geographies, and which communities show patterns that warrant closer examination. The HMDA data is the map. The fair lending examination, with access to the lender's full loan file data including actual credit scores and verified income, is the territory.

Related writing

Foreign agents in plain sight: mapping DC's hidden influence network with FARA data — How to acquire, parse, and cross-reference the DOJ FARA bulk dataset — four ZIP files buried in an Oracle APEX endpoint — to map foreign-government lobbying in the United States.

Who won, who lost: five years of union elections in NLRB data — How to pull, clean, and analyze NLRB union election records — RC and RD cases, the 2021–2024 organizing surge, the 100k export cap workaround, and cross-dataset correlations.