Technical writing

Tracking PAC money through FEC data: entity resolution across 50 filing types at 2.4 million donor records

September 16, 2024· 14 min read· AI Analytics

OSINTData engineeringEntity resolutionElections

The Federal Election Commission publishes data from 50+ filing form types covering committees, candidates, individual donors, independent expenditures, party transfers, and more. At the surface level this is a transparency system — every dollar in federal election spending is required to be disclosed. In practice, the data is fragmented across incompatible identifiers, inconsistent name spellings, and organizational structures designed (often deliberately) to obscure beneficial ownership.

We built an entity resolution pipeline that ingests the full FEC bulk data export (approximately 2.4 million individual donor records per two-year election cycle, plus 700K committee records) and produces a canonical entity graph linking donors to committees, committees to candidates, Super PACs to their LLC donors, and LLCs to their beneficial owners via FinCEN and PACER records.

FEC committee type taxonomy

The FEC uses single-letter committee type codes that determine what a committee can legally do and what filings it must submit. Entity resolution must handle these types differently because the same real-world political organization may operate multiple committees of different types.

Code	Type	Resolution challenge
H	House authorized committee	Links directly to candidate via FEC_ID; trivial
S	Senate authorized committee	Links directly to candidate; trivial
P	Presidential authorized committee	Links directly to candidate; trivial
X	Party committee (national)	Multiple committees per party; require party-name normalization
Y	Party committee (state/local)	State prefix extraction; 50-state deduplication
N	PAC (non-qualified)	Name matching to sponsoring organization required
Q	PAC (qualified)	Name matching to sponsoring organization required
O	Super PAC (independent expenditure only)	LLC donors require beneficial owner trace
I	Hybrid PAC	Mixed donor pool; both rules apply
U	Joint fundraising committee	Allocation formulas link to multiple participant committees

Joint fundraising committee resolution

Joint fundraising committees (JFCs, type U) are the most structurally complex entity in FEC data. A JFC is a temporary committee formed by two or more candidate committees and/or party committees to raise money together and allocate receipts among participants according to a registered allocation formula.

The FEC records JFC transfers as separate itemized disbursements from the JFC to each participant — but the connection between a donor's original contribution to the JFC and the ultimate beneficiary committees is buried in the JOINT_FUNDRAISING_REP fields and in separate FEC Form 99 filings. Without resolving these links, a money-flow analysis will attribute tens of millions of dollars to JFCs that appear to be endpoints when they are actually passthrough vehicles.

from dataclasses import dataclass, field
from typing import Optional

@dataclass
class JFCAllocation:
    participant_committee_id: str    # FEC committee ID of the participant
    participant_name: str
    allocation_pct: float            # 0.0-1.0; must sum to 1.0 across participants
    allocation_cap_usd: Optional[float]  # per-donor aggregate limit (None = no cap)

@dataclass
class JointFundraisingCommittee:
    jfc_committee_id: str            # FEC ID of the JFC itself
    jfc_name: str
    registered_agent: str
    participants: list[JFCAllocation] = field(default_factory=list)
    # Computed fields populated after resolving Form 99 filings
    total_raised_usd: float = 0.0
    total_transferred_usd: float = 0.0

def resolve_jfc_participants(
    jfc_id: str,
    form99_df,      # DataFrame of FEC Form 99 filings
    disbursements_df,  # DataFrame of FEC disbursement records
) -> JointFundraisingCommittee:
    """Reconstruct JFC allocation from Form 99 + disbursement records."""
    # Form 99 filings contain the human-readable allocation memo
    # Disbursements from the JFC to participants give the realized allocation
    form99_rows = form99_df[form99_df['committee_id'] == jfc_id]
    allocation_text = ' '.join(form99_rows['memo_text'].dropna().tolist())

    # Parse "Candidate A: 75%, Candidate B: 25%" allocation strings
    participants = _parse_allocation_memo(allocation_text)

    # Cross-validate against actual disbursement records
    actual_disbursements = disbursements_df[
        (disbursements_df['filer_committee_id'] == jfc_id) &
        (disbursements_df['disbursement_type'].isin(['24K', '24Z']))
    ]

    # If memo parsing failed, infer from realized disbursements
    if not participants:
        total = actual_disbursements['disbursement_amount'].sum()
        for _, row in actual_disbursements.iterrows():
            participants.append(JFCAllocation(
                participant_committee_id=row['recipient_committee_id'],
                participant_name=row['recipient_name'],
                allocation_pct=row['disbursement_amount'] / total if total > 0 else 0.0,
                allocation_cap_usd=None,
            ))

    return JointFundraisingCommittee(
        jfc_committee_id=jfc_id,
        jfc_name=form99_rows.iloc[0]['committee_name'] if len(form99_rows) > 0 else jfc_id,
        registered_agent='',
        participants=participants,
    )

Name normalization pipeline

FEC filings are free-text fields entered by campaign staff and data-entry operators. The same donor appears as “Apple, Inc.”, “Apple Inc”, “APPLE INC.”, and “Apple Incorporated” across different filing cycles. The normalization pipeline reduces all of these to a canonical form before matching.

import re
import unicodedata

# Legal entity suffixes to strip from organization names
LEGAL_SUFFIXES = re.compile(
    r'(inc|incorporated|corp|corporation|co|company|llc|llp|lp|ltd|limited'
    r'|pc|pllc|pa|assoc|associates|grp|group|intl|international'
    r'|fund|pac|super pac|political action committee).?$',
    re.IGNORECASE,
)

# Individual name prefixes/suffixes
NAME_PREFIXES = re.compile(r'^(mr|mrs|ms|dr|prof|rev|hon|capt|col|gen).?s+', re.IGNORECASE)
NAME_SUFFIXES = re.compile(r's+(jr|sr|ii|iii|iv|esq|md|phd|cpa).?$', re.IGNORECASE)

def normalize_entity_name(raw: str, entity_type: str = 'ORG') -> str:
    """
    Normalize a FEC filer/recipient name to a canonical form.
    entity_type: 'ORG' or 'IND' (individual).
    """
    if not raw:
        return ''

    # Unicode normalization: decompose, strip diacritics, recompose
    nfkd = unicodedata.normalize('NFKD', raw)
    ascii_str = ''.join(c for c in nfkd if not unicodedata.combining(c))

    # Uppercase, collapse whitespace, strip punctuation except hyphens
    s = re.sub(r'[^ws-]', ' ', ascii_str.upper())
    s = re.sub(r's+', ' ', s).strip()

    if entity_type == 'ORG':
        # Strip trailing legal suffixes (iteratively; 'Apple Inc Corp' -> 'Apple')
        prev = None
        while prev != s:
            prev = s
            s = LEGAL_SUFFIXES.sub('', s).strip()
    elif entity_type == 'IND':
        s = NAME_PREFIXES.sub('', s)
        s = NAME_SUFFIXES.sub('', s).strip()

    return s

Four-pass entity resolution

After normalization, entity resolution runs four passes in order of decreasing precision. Each pass assigns a match_confidence score and the entity is resolved at the first pass that exceeds the confidence threshold for that pass.

Pass	Method	Threshold	Recall	FP rate
1 — Exact ID	FEC committee ID, EIN, or DUNS exact match	1.00	41%	<0.01%
2 — Exact name	Normalized name exact match	1.00	31%	0.3%
3 — TF-IDF cosine	Character 3-gram TF-IDF vectors, cosine ≥ 0.82	0.82	18%	1.2%
4 — Address block	ZIP + street number + normalized name Jaro-Winkler ≥ 0.88	0.88	5%	2.1%

The four passes together resolve 95.5% of entities with a combined false positive rate of 0.8% (weighted by pass volume). The remaining 4.5% of unresolved entities are treated as new entities — an intentionally conservative choice to avoid creating false money-flow links.

TF-IDF cosine matching for committees

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

class FECEntityMatcher:
    """
    TF-IDF cosine matcher for FEC committee and organization names.
    Uses character 3-grams to handle abbreviations and partial matches.
    """
    def __init__(self, threshold: float = 0.82):
        self.threshold = threshold
        self.vectorizer = TfidfVectorizer(
            analyzer='char_wb',
            ngram_range=(3, 3),
            min_df=1,
            sublinear_tf=True,
        )
        self._fitted = False
        self._canonical_names: list[str] = []
        self._canonical_ids: list[str] = []

    def fit(self, canonical_names: list[str], canonical_ids: list[str]) -> None:
        self._canonical_names = canonical_names
        self._canonical_ids = canonical_ids
        self.vectorizer.fit(canonical_names)
        self._matrix = self.vectorizer.transform(canonical_names)
        self._fitted = True

    def match(self, query_name: str, top_k: int = 3) -> list[tuple[str, str, float]]:
        """Return [(canonical_id, canonical_name, score)] sorted descending."""
        if not self._fitted:
            raise RuntimeError('Call fit() first')
        q_vec = self.vectorizer.transform([query_name])
        scores = cosine_similarity(q_vec, self._matrix)[0]
        top_idxs = np.argsort(scores)[::-1][:top_k]
        results = []
        for idx in top_idxs:
            if scores[idx] >= self.threshold:
                results.append((
                    self._canonical_ids[idx],
                    self._canonical_names[idx],
                    float(scores[idx]),
                ))
        return results

LLC chain disambiguation

Super PACs (type O) may accept unlimited contributions from corporations, including LLCs. When an LLC contributes to a Super PAC, the disclosed donor is the LLC — not the individual or corporation behind it. Tracing the beneficial owner requires cross-referencing three external sources:

FinCEN Beneficial Ownership registry (public bulk export since January 2024): maps EIN → reported beneficial owners. Coverage: ~60% of LLCs formed after January 2024; older entities require alternative sources.
SEC EDGAR Exhibit 21 (subsidiary lists in 10-K filings): maps parent corporation → subsidiary LLCs. Coverage: publicly traded parents only (~35% of large-dollar Super PAC donors).
State secretary of state records (scraped for DE, WY, NV, FL — the four states used for ~72% of campaign-finance LLCs): maps LLC → registered agent and sometimes to member names. Coverage: variable; Wyoming and Nevada allow anonymous LLCs with no public member disclosure.

The LLC chain resolution function attempts each source in order and assigns a confidence tier to the result: CONFIRMED (FinCEN registry, Jaro-Winkler ≥ 0.92 on beneficial owner name), PROBABLE (EDGAR Exhibit 21, ≥ 0.85), POSSIBLE (state SOS, ≥ 0.80), or UNRESOLVED. Of 14,200 LLC donors in the 2022 election cycle, 38% were CONFIRMED, 22% PROBABLE, 18% POSSIBLE, and 22% UNRESOLVED— primarily Wyoming and Nevada shell companies with no public member registry.

Output: the FEC entity graph

The resolved entity graph stores three relationship types in a PostgreSQL adjacency table: DONATES_TO (donor → committee, weighted by total contribution amount), PARTICIPANT_OF (committee → JFC, with allocation percentage), and CONTROLLED_BY (LLC → beneficial owner, with confidence tier). Graph traversal from any node to its beneficial owner takes at most 4 hops for 99.2% of cases in our 2022 election cycle dataset.

Money-flow queries run as recursive CTEs in PostgreSQL with a depth limit of 8 hops. The p50 query time for a full money-flow trace from a presidential candidate to all traceable individual beneficial owners is 340ms over 2.4 million records — acceptable for analytical workloads, though we cache the top 200 candidates' full graphs in Redis for interactive use.

For the statistical anomaly detection methods applied to the election data this pipeline feeds: Election statistical anomaly detection: Benford's Law, ARIMA, and turnout modeling across 47 races →

For coordinated campaign detection using the donor graph this pipeline produces: Coordinated campaign detection: identifying inauthentic networks in election data →