Technical writing

Following the Money: Joining Federal Campaign Finance, Lobbying, and Spending Data

· 13 min read· AI Analytics
Entity ResolutionCampaign FinanceLobbyingFederal SpendingData Engineering

A defense contractor gives political money through a PAC, pays a lobbying firm to work an appropriations bill, wins a multibillion-dollar contract from the agency that bill funds, and—years later—settles a False Claims Act case over how it billed that very work. Each of those facts lives in a different federal database, and none of the four databases knows the other three exist. They share no common key, no identifier, no crosswalk. The single most valuable move in federal data analysis is also the hardest—stitching one company's full federal footprint back together across systems that were never built to be joined.

This article covers what it means to follow one entity across the federal money trail and why it is worth the effort; the four anchor datasets—FEC campaign finance, Senate and House lobbying disclosures, USAspending awards, and DOJ False Claims Act recoveries—and what each one records; the central obstacle, which is that none of these systems share a common identifier; the discipline of entity resolution—name normalization, parent-and-subsidiary mapping, and the unavoidable trade-off between false positives and missed links that fuzzy matching forces; the set of questions that only the join can answer, which no single dataset can; how our four tables make the crosswalk tractable by handling the parsing so the work becomes the join; a Python workflow that pulls one company's records from each public source, normalizes the names to a common form, and assembles a single-entity profile; and the caveats—above all match confidence—that every analyst must carry before turning a fuzzy join into a claim about a real company.

What it means to follow the money

Most federal-data work stays inside a single dataset. You count contracts in USAspending, or rank lobbying spend in the disclosures, or trace contributions in the FEC's files, and each of those is a legitimate, self-contained question. But the questions that matter most—the ones journalists, oversight bodies, and researchers actually want answered—almost always cross dataset boundaries. They are questions about a single actor: this company, this individual, and everything the federal government records about its relationship with them. Following the money means assembling, for one entity, the political contributions it makes, the lobbying it pays for, the contracts and grants and loans it wins, and the fraud cases brought against it—so that the entity's full federal footprint becomes visible in a single view.

The payoff is leverage that no individual source provides. A contract award is a number; a contract award next to the lobbying spend that preceded it, the contributions clustered around the committee that oversees the awarding agency, and a later False Claims Act settlement over the same line of business is a story—one that can be tested, quantified, and replicated. This is the analysis that turns four siloed compliance records into a map of influence and accountability. It is also, precisely because the systems were built independently by different branches and agencies under different statutes, a genuine data-engineering problem rather than a query. The rest of this article is about how to do it carefully and what it costs you when you do it carelessly.

The four anchor datasets

Four federal systems anchor the money trail, each administered by a different part of the government under a different authority. The Federal Election Commission (FEC) maintains the campaign-finance record: every political committee—candidate committees, traditional PACs, super PACs, party committees—and the contributions that flow into and out of them. This is the dataset that answers who gives political money, in what amounts, through which committees. The FEC has published these records since the post-Watergate reforms of the 1970s, and they are released through both a modern REST API and downloadable bulk files. In our database the committee registry lives in fec_committees, alongside the contribution records.

The lobbying disclosures are the second anchor. Under the Lobbying Disclosure Act of 1995—substantially strengthened by the Honest Leadership and Open Government Act of 2007—lobbyists and the firms that employ them must register and file quarterly activity reports with the Secretary of the Senate and the Clerk of the House. Each filing names the registrant(the lobbying firm or in-house lobbying operation), the client (the company or interest paying to be represented), the issues lobbied, the agencies and chambers contacted, and a reported spending amount. This is who is paid to influence Washington, on which issues, for whom. Our lobbying_activity table carries the registrant, client, issue, and amount fields parsed from the Senate LDA system.

USAspending.gov is the third anchor and the largest by dollar volume—the government's official record of the contracts, grants, and loans it awards, mandated by the Federal Funding Accountability and Transparency Act of 2006 and the DATA Act of 2014. Every prime award names a recipient, the awarding agency, the amount, and the dates, and—crucially for this exercise—identifies the awardee by a Unique Entity Identifier (UEI), the twelve-character code that replaced the proprietary DUNS number in 2022. Our usaspending_contracts table holds the recipient, agency, award, and UEI fields. The fourth anchor is the Department of Justice's False Claims Act recoveries: the fraud settlements and judgments against government contractors and healthcare firms that knowingly submitted false claims for federal money. The DOJ announces these—billions of dollars a year, a large share originating in whistleblower qui tam suits—in press releases that name the defendant in prose. Our doj_fca table parses those into structured rows.

The central obstacle: no common key

Here is the problem that makes this hard, and it is worth stating bluntly because it is the single fact that governs every design decision downstream: none of these four systems share a common identifier for the entities they describe. There is no federal-government-wide company ID that appears in all four. They were built by different institutions, under different laws, for different purposes, and each chose its own way of naming the parties involved.

USAspending is the best off: it identifies awardees by the UEI (formerly the DUNS number), a real, persistent identifier maintained in the System for Award Management. But that UEI exists nowhere else in the money trail. The FEC has no equivalent identifier at all—a contributor or a committee is stored as a free-text name, typed by a filer, with all the spelling variants, abbreviations, and inconsistencies that implies. The lobbying filings likewise name registrants and clients as free text, and the same client can appear as “Acme Corp,” “Acme Corporation,” and “ACME CORP.” across quarters. And the DOJ False Claims Act announcements name defendants in prose—a paragraph of English, not a database field—so the defendant's name must be extracted before it can be matched to anything at all.

The consequence is that joining these datasets is not a JOIN ... ON id = id. It is entity resolution—the discipline of deciding, without a shared key, which records across systems refer to the same real-world entity. That means normalizing names into a comparable form, mapping subsidiaries and divisions back to their corporate parents, and accepting, as an inescapable property of the method rather than a bug to be fixed, that fuzzy matching produces both false positives (two different companies collapsed into one) and missed links (one company split across spellings that never get reconciled). There is no setting that eliminates both. Every money-trail join is a point chosen on that trade-off curve, and the honest analyst names the point they chose.

To make the shape of the problem concrete, the four tables differ in their grain and their entity column as follows:

-- fec_committees / contributions  (grain: one committee; one contribution)
committee_id        -- FEC committee ID (key within FEC only, not elsewhere)
committee_name      -- FREE TEXT
contributor_name    -- FREE TEXT  <-- the join surface for a company/person
contribution_amount -- dollars
contribution_date   -- date

-- lobbying_activity  (grain: one filing / activity line)
registrant_name     -- FREE TEXT  (the lobbying firm)
client_name         -- FREE TEXT  <-- the join surface for the paying company
issue_area          -- issue code lobbied
amount              -- reported lobbying spend
filing_period       -- quarter / year

-- usaspending_contracts  (grain: one award)
recipient_name      -- FREE TEXT
recipient_uei       -- UEI (real key -- but lives only in USAspending)
awarding_agency     -- agency that made the award
award_amount        -- obligated dollars

-- doj_fca  (grain: one settlement / judgment, parsed from a press release)
defendant_name      -- FREE TEXT, extracted from PROSE  <-- the join surface
settlement_amount   -- recovery amount
case_date           -- announcement date

-- NO COLUMN IS SHARED ACROSS ALL FOUR. The only join surface is the name.

Entity resolution: normalizing names

The first and most consequential step is name normalization: transforming the messy free-text names from each system into a canonical form so that variants of the same company collapse to the same string. The standard moves are uppercasing, stripping punctuation, collapsing whitespace, and—most importantly—removing the corporate suffixes that proliferate without changing the entity: Inc, Incorporated, LLC, Corp, Corporation, Co, Company, LP, LLP, Ltd, Holdings, Group. After this pass, “Acme Corp.,” “ACME CORPORATION,” and “Acme Co, LLC” all reduce to the same key, “ACME.” It sounds trivial. It is not. Suffix stripping that is too aggressive will merge genuinely distinct firms that share a base name; suffix stripping that is too timid will leave the variants unmatched. The normalizer is the weakest link in the entire pipeline, and it deserves more scrutiny than any other component.

Normalization alone is not enough, because the federal money trail is full of parent-and-subsidiary structure. A large contractor wins awards under a dozen operating divisions and acquired subsidiaries, each with its own name and its own UEI; the parent's PAC gives contributions under the parent's name; the lobbying is filed by a Washington office under yet another label. A name-only join treats these as separate entities and therefore radically understates the parent's true footprint. Resolving this requires corporate hierarchy mapping—a lookup that rolls subsidiaries and divisions up to their ultimate parent. There is no single authoritative federal source for this; analysts build it from a combination of SAM parent-organization fields on the USAspending side, public ownership disclosures, and hand-curated lists for the major players. The quality of the parent map is often what separates a credible money-trail analysis from a misleading one, because the largest and most interesting entities are exactly the ones with the most complex structure.

Two further wrinkles deserve mention. Individuals(executives, board members, named defendants) are even harder than companies, because personal names are less distinctive, more often abbreviated, and frequently shared—the same name can belong to many different people, and one person can appear as “Robert,” “Bob,” and “R.” And the DOJ False Claims Act side adds an extraction step before normalization can even begin: the defendant name has to be pulled out of an English sentence, with its own error rate, before it becomes a string you can normalize and match. Every layer compounds the uncertainty, which is why match confidence—not match existence—is the quantity that matters.

Fuzzy matching and its error rates

Even after careful normalization, exact string equality misses real matches: a typo, a transposition, an extra word, a dropped abbreviation. So the practical join layers fuzzy matching on top of normalization— comparing names by similarity rather than identity, using token-based measures (treating a name as a bag of words and scoring overlap), edit-distance measures (counting the character changes needed to turn one string into another), or phonetic encodings. Each candidate pair gets a similarity score, and the analyst sets a threshold above which a pair is treated as the same entity.

That threshold is the trade-off made explicit. Set it high—demand near-exact similarity—and you minimize false positives but accept more missed links: real matches with slightly different spellings slip through, and you understate the footprint. Set it low—accept loose similarity—and you catch more variants but start merging genuinely different companies that happen to share words, inflating the footprint with records that do not belong to the entity. There is no threshold that is right for every purpose. A screening tool that wants to surface every possible link for a human to review should favor recall (a low threshold, more candidates); a published statistic that will be quoted as fact should favor precision (a high threshold, plus manual review of the borderline cases). The disciplined practice is to treat fuzzy matches as candidates, not conclusions: generate them automatically, then have a human adjudicate the ambiguous ones—especially for the high-stakes matches where a single wrong link can invalidate a claim. The output of the pipeline should carry a confidence flag on every cross-dataset link, and the analysis built on top of it should propagate that confidence rather than silently treating every match as certain.

The questions only the join can answer

All of this effort is justified by a class of questions that no single dataset can address, because each requires data from two or more of the systems at once. These are the questions that make the money trail worth tracing.

Does a contractor's lobbying spend track the contracts it wins? With lobbying_activity joined to usaspending_contracts by entity, you can place a firm's quarterly lobbying expenditure next to its award flow over time and ask whether the two move together, whether lobbying leads awards, and on which issues the alignment is tightest. No single source contains both halves. Do a firm's political contributions cluster around the committees that oversee its agency? Joining fec_committees and the contribution records to the awarding-agency field of its contracts lets you test whether the money a contractor gives concentrates on the members and committees with jurisdiction over the agency that pays it—the authorizers and appropriators of its budget—rather than being spread at random.

Do firms with large federal awards also turn up in False Claims Act settlements? Joining usaspending_contracts to doj_fca by entity surfaces the contractors that both win heavily and settle heavily—and lets you ask whether the fraud recoveries are concentrated among the biggest awardees or distributed differently, and whether a settlement is followed by a change in subsequent award flow. And how concentrated is the overlap among the biggest players? Across all four datasets, you can ask what share of total contributions, lobbying, awards, and recoveries is accounted for by the small set of entities that appear prominently in all four—the firms whose federal footprint is comprehensive. That concentration measure is itself a finding, and it is only computable once the entities have been resolved across the systems. Every one of these questions is a join away—and a join is, here, an act of entity resolution with a confidence attached.

How our tables make the crosswalk tractable

The reason this analysis is approachable at all in our database is that the four tables—fec_committees and the contribution records, lobbying_activity, usaspending_contracts, and doj_fca—are each already parsed and queryable. The single most time-consuming part of a money-trail project, for an analyst starting from raw sources, is not the join at all; it is the upstream parsing—decoding the FEC's fixed-width bulk formats, paging the lobbying API and flattening its nested filings, normalizing the USAspending award schema, and extracting structured settlement rows from DOJ prose. When those four parsing jobs are done and the data sits in clean tables, the work that remains is the genuinely interesting part: the join between them.

That is the deliberate division of labor. The tables handle the parsing; the analyst handles the entity resolution and the cross-dataset questions. It is worth being clear about what this does and does not buy you. It does not hand you a resolved entity key—the free-text names are still free text, and the UEI still lives only on the USAspending side, so the matching problem described above is yours to solve and yours to assign confidence to. What it buys you is that you never have to fight the source formats: the contributor name, the client name, the recipient name and UEI, and the defendant name are all sitting in columns, ready to be normalized and compared. The work becomes the join between the tables rather than the parsing of each—which is exactly where an analyst's judgment is most valuable and least replaceable.

The public sources behind the tables

All four anchors are public and either key-free or low-friction to access, which is what makes the workflow reproducible. The FEC exposes a modern REST API at api.open.fec.gov (a free key from api.data.gov covers it) and also publishes complete bulk files—the candidate, committee, and individual-contribution master files—for anyone who needs the full record rather than query slices. The lobbying disclosures are available from the Senate at lda.senate.gov, which offers both a REST API and downloadable filing data, with a parallel record on the House side; the Senate system is the one most analysts pull from. USAspending provides a comprehensive, no-key REST API at api.usaspending.gov covering award search, recipient profiles, and agency spending, plus bulk award downloads. And the DOJpublishes its False Claims Act recoveries through its press-release archive and periodic statistical summaries; because these are prose, the structured doj_fca table is the product of parsing them, and any defendant match back to the other three sources should be treated as the lowest-confidence link in the chain—the one most worth confirming by reading the underlying case.

Python workflow: a single-entity profile

The script below pulls one company's records from each public source, normalizes the name to a common form, and assembles a single-entity profile spanning contributions, lobbying, and awards—with the DOJ False Claims Act side flagged as a manual, low-confidence cross-check rather than an automatic join, because the defendant names live in prose. It hits the real FEC, Senate LDA, and USAspending APIs; only the FEC requires a key, and a free DEMO_KEY-style key from api.data.gov suffices for exploration. The normalize function is intentionally simple so its behavior is legible—it is the part you should harden first for production work.

import requests, re
import pandas as pd

# ----------------------------------------------------------------------
# Single-entity federal money-trail profile.
#
# Pulls one company’s records from FOUR public federal systems, normalizes
# the entity name to a common form, and assembles a single profile spanning
# political contributions, lobbying, awards, and (qualitatively) enforcement.
#
#   1. FEC         -- contributions / committees (api.open.fec.gov, free key)
#   2. Lobbying    -- Senate LDA disclosures     (lda.senate.gov REST API)
#   3. USAspending -- contracts/grants/loans     (api.usaspending.gov, no key)
#   4. DOJ FCA     -- False Claims Act recoveries (named in prose; manual match)
#
# CRITICAL: none of these share a key. USAspending uses the UEI; FEC,
# lobbying, and DOJ store names as free text. We match on a normalized name
# and report a confidence flag, NOT a guarantee.
# ----------------------------------------------------------------------

FEC_KEY = "DEMO_KEY"   # get a free key at api.data.gov

def normalize(name):
    # Crude name canonicalizer: uppercase, strip punctuation and the most
    # common corporate suffixes so "Acme Corp." and "ACME CORPORATION"
    # collapse to one key. This is the weakest link in the whole job.
    n = name.upper()
    n = re.sub(r"[.,&/']", " ", n)
    suffixes = r"\b(INC|INCORPORATED|LLC|L L C|CORP|CORPORATION|CO|COMPANY|LP|LLP|LTD|HOLDINGS|GROUP|THE)\b"
    n = re.sub(suffixes, " ", n)
    return re.sub(r"\s+", " ", n).strip()

def fec_contributions(name):
    # Schedule A receipts where the contributor name matches (free text).
    url = "https://api.open.fec.gov/v1/schedules/schedule_a/"
    p = {"api_key": FEC_KEY, "contributor_name": name,
         "per_page": 100, "sort": "-contribution_receipt_date"}
    d = requests.get(url, params=p, timeout=60).json()
    return pd.DataFrame(d.get("results", []))

def lobbying(name):
    # Senate LDA: filings where the client name matches (free text).
    url = "https://lda.senate.gov/api/v1/filings/"
    d = requests.get(url, params={"client_name": name, "page_size": 100},
                     timeout=60).json()
    return pd.DataFrame(d.get("results", []))

def usaspending(name):
    # USAspending award search by recipient name; carries the UEI key.
    url = "https://api.usaspending.gov/api/v2/search/spending_by_award/"
    body = {"filters": {"recipient_search_text": [name],
                        "award_type_codes": ["A", "B", "C", "D"]},
            "fields": ["Award ID", "Recipient Name", "Award Amount",
                       "Awarding Agency", "recipient_id"],
            "page": 1, "limit": 100}
    d = requests.post(url, json=body, timeout=90).json()
    return pd.DataFrame(d.get("results", []))

def profile(company):
    key = normalize(company)
    fec = fec_contributions(company)
    lob = lobbying(company)
    awd = usaspending(company)

    contrib = fec["contribution_receipt_amount"].sum() if not fec.empty else 0
    award = awd["Award Amount"].sum() if not awd.empty else 0
    print(f"Entity: {company}   (normalized key: {key})")
    print(f"  FEC contributions matched : {len(fec):>5}  total ${contrib:>14,.0f}")
    print(f"  Lobbying filings matched  : {len(lob):>5}")
    print(f"  Federal awards matched    : {len(awd):>5}  total ${award:>14,.0f}")
    # DOJ FCA settlements are announced in prose press releases with no key,
    # so this step is a MANUAL cross-check against the doj_fca table, flagged
    # rather than auto-joined. Never assert a match without reading the case.
    print("  DOJ FCA: cross-check doj_fca by normalized name -- MANUAL, low confidence")
    return {"fec": fec, "lobbying": lob, "awards": awd}

profile("Lockheed Martin")

Two practical notes apply. First, the matching here is deliberately coarse: it relies on each API's own name-search parameter and a single shared normalizer, which means the counts it prints are candidate matches, not confirmed ones—a serious profile would roll subsidiaries up to the parent before searching, layer fuzzy scoring with an explicit threshold on top of the API results, and carry a confidence flag through to every printed figure. Second, for anything beyond a single company—ranking concentration across all entities, or building the full four-way overlap—the bulk files (the FEC master files, the Senate LDA downloads, and the USAspending bulk award extracts) are far more efficient than thousands of per-entity API calls, and they let you build the normalizer and parent map once over the whole universe rather than re-deriving them per query. The single-entity script is the right shape for investigation; the bulk pipeline is the right shape for measurement.

Limitations and analytical caveats

The money-trail join is powerful precisely because it crosses boundaries no single dataset crosses, but that same property loads it with caveats that an analyst must carry before turning a fuzzy match into a claim about a real company.

Match confidence is the whole game. Because the join surface is a free-text name rather than a shared key, every cross-dataset link is a probabilistic judgment, not a fact. A false positive—two distinct firms merged on a shared word—invents a relationship that does not exist; a missed link—one firm split across spellings—hides a relationship that does. Both errors are present in any real run, and the right response is not to pretend they are gone but to quantify them: carry a confidence score on every link, review the high-stakes matches by hand, and report findings with the matching method and threshold stated, so a reader can judge how much weight the join can bear.

Corporate structure obscures the true entity. Parents, subsidiaries, divisions, joint ventures, and shell entities mean that the legal name on a record is frequently not the economic actor you care about. Without a good parent map, a name-only join understates large, complex firms—the most consequential ones—by scattering their footprint across labels, and there is no single authoritative federal source for ownership to lean on. Any concentration or footprint claim is only as good as the hierarchy mapping behind it.

Correlation is not influence, and timing is not causation.That a contractor's lobbying spend tracks its awards, or that its contributions cluster around its overseers, is a real and testable pattern—but it is a pattern, not proof of a quid pro quo. Disclosed lobbying, lawful contributions, and competitively awarded contracts can co-occur for many reasons, including that large firms in regulated markets do all three at scale simply because they are large. The data supports description and hypothesis generation; it does not, on its own, establish intent. And each source carries its own coverage limits—contribution-disclosure thresholds and the super-PAC and dark-money channels that sit outside the itemized FEC record, lobbying reported in ranges and excluding activity below the registration threshold, and DOJ recoveries that capture settled cases rather than all misconduct. Held with these caveats in mind, the four-table money trail is among the most revealing analyses federal data permits: a way to see one entity's political giving, paid influence, public awards, and fraud exposure in a single frame—so long as the analyst never forgets that the frame is held together by names, and names are the most fragile key of all.

Related writing

USASpending.gov: The Federal Spending Database Behind $6 Trillion in Annual Contracts, Grants, and Loans — The awards half of the money trail and the only one of the four anchors with a real entity key, the UEI, whose structure and recipient model are what make the spending side joinable in the first place.

FEC Committees: The Federal Registry of Every PAC, Super PAC, and Campaign Committee — The campaign-finance anchor, where committees and contributors are stored as free-text names with no shared identifier, the source of both the political-giving signal and the hardest matching problem in the join.

Federal Lobbying Disclosures: The Public Record of Who Is Paid to Influence Washington — The paid-influence anchor, naming registrants and clients in free text, whose quarterly spend is the series you place next to a firm's award flow to ask whether lobbying tracks contracts.