Technical writing

The Employment-Immigration Pipeline: From Labor Certification to H-1B Approval

· 13 min read· AI Analytics
ImmigrationH-1BDOLUSCISData Engineering

When a US employer hires a foreign software engineer or research scientist, the paperwork does not live in one place. It is split across two agencies and three federal datasets—the Department of Labor's certification that the wage is fair, the USCIS record of whether the petition was approved, and the Bureau of Labor Statistics survey that sets the wage floor in the first place. Each carries the employer name and the occupation code, so the whole pipeline—from labor certification to H-1B approval—can be reassembled into a single view of who hires whom, at what wage, and how often the application becomes an actual worker.

This article is a data-engineering guide rather than a tour of a single table. It covers the two-agency process the data mirrors—the Department of Labor sign-off that must come before USCIS will adjudicate the petition; the three datasets that make the pipeline tractable and the table each lands in; the two join keys, the employer and the SOC occupation code, and why normalizing employer names across the three systems is the central obstacle; the prevailing wage and how it is benchmarked to the BLS Occupational Employment and Wage Statistics; the H-1B numerical cap and its lottery, which open the gap between applications and approvals; the analytical questions the assembled data answers, from employer conversion rates to whether sponsored wages sit above the local floor; a Python workflow that pulls one employer's labor applications and approved petitions and compares the offered wages to the benchmark; and the caveats that come from joining three datasets that were never designed to be joined.

The two-agency pipeline the data mirrors

Employment-based immigration in the United States is, at heart, a two-step, two-agency process, and the structure of the data follows the structure of the process exactly. Before a US employer can petition for most work visas, it must first obtain Department of Labor sign-off on the labor-market question: does hiring this foreign worker harm the wages or working conditions of US workers similarly employed? For an H-1B, that sign-off is a Labor Condition Application (LCA), filed with the Office of Foreign Labor Certification; for an employment-based green card, it is a full PERM labor certification. In both, the employer attests that it will pay at least the prevailing wage and will not displace US workers.

Only after the Department of Labor step clears does the employer file the petition that USCIS—U.S. Citizenship and Immigration Services, inside the Department of Homeland Security—actually adjudicates: the I-129 for an H-1B, the I-140 for an employment-based immigrant. The Department of Labor owns the wage-and-labor question; USCIS owns the immigration question of whether the employer, the job, and the worker are real and the petition approvable. The prevailing wage that anchors the first step is itself benchmarked to the third dataset—the BLS Occupational Employment and Wage Statistics (OEWS)survey—so the wage floor an employer attests to is not invented case by case but drawn from a national wage survey. The pipeline, in other words, is a chain: BLS sets the floor, DOL certifies the application against it, USCIS decides the petition. The three datasets are the three links.

Three datasets, three tables

The work of assembling the pipeline is the work of joining three federal labor-and-immigration datasets, each of which we store as its own table. None requires an API key; all three are public bulk downloads. The point of pre-loading them as tables is that the analyst's job becomes normalizing employer names and aligning SOC codes across the three—not parsing three idiosyncratic source formats from scratch every time.

The first is the DOL OFLC disclosures, stored as dol_oflc: the quarterly performance-data files the Office of Foreign Labor Certification publishes on dol.gov, with one row per labor case—the LCA and PERM filings (and the temporary H-2A and H-2B programs). Each row carries the employer, the SOC occupation, the worksite, the offered and prevailing wage, and the case status. The second is the USCIS H-1B records, stored as uscis_h1b: drawn from the USCIS H-1B Employer Data Hub, which reports, by employer and fiscal year, the counts of approved and denied H-1B petitions—the adjudication outcome the disclosure data cannot see. The third is the BLS OEWS wage data, stored as bls_oews: the Occupational Employment and Wage Statistics, which estimate employment and wage percentiles for every SOC occupation nationally and by metropolitan area—the benchmark against which both the prevailing wage is set and the offered wage can be judged. The three tables share the columns that make the join possible:

-- dol_oflc  (labor side: one row per LCA / PERM application)
employer_name        -- the attesting / petitioning employer
soc_code             -- Standard Occupational Classification of the job
worksite_state       -- state of the place of employment
wage_offer           -- the wage the employer offers to pay
prevailing_wage      -- the DOL-determined floor for the job + area
case_status          -- certified, denied, withdrawn

-- uscis_h1b  (adjudication side: one row per employer-year)
employer_name        -- the petitioning employer (Employer Data Hub)
fiscal_year          -- USCIS fiscal year of the petitions
initial_approvals    -- new H-1B petitions approved
initial_denials      -- new H-1B petitions denied

-- bls_oews  (wage benchmark: one row per SOC, national or by metro)
occ_code             -- the SOC occupation code (join key to soc_code)
area_title           -- nation or metropolitan statistical area
a_median             -- annual median wage for the occupation + area
pct10 / pct25 / pct75 / pct90  -- wage percentiles (the level structure)

Read down those three blocks and the architecture of the pipeline is visible in the columns. The employer_name appears in the labor table and the adjudication table, tying a firm's applications to its approvals. The soc_code in the labor table is the same taxonomy as occ_code in the wage table, tying a sponsored job to the wage distribution for that occupation. And the worksite geography in the labor table lines up with the area_title in the wage table, because the prevailing wage is set for an occupation in an area, not nationally. The three keys—employer, occupation, geography—are exactly the dimensions along which the high-skilled-immigration debate is conducted, which is why the assembled data, rather than any one source, is what answers the questions at its center.

The join keys: employer and occupation

Two keys carry the join, and they behave very differently. The SOC code is the easy one. The Standard Occupational Classification is a stable federal taxonomy of occupations, and because the Department of Labor codes every labor application to a SOC and the BLS publishes OEWS estimates by SOC, the two line up cleanly—subject only to the periodic revisions of the SOC system itself (the 2010 and 2018 vintages differ), which a multi-year analysis must crosswalk. Align the SOC vintages and the labor-to-wage join is essentially mechanical: a sponsored job in a given SOC can be placed directly against the wage distribution for that SOC.

The employer name is the hard one, and it is the usual obstacle in any serious pipeline analysis. The three systems were built independently and identify employers by self-reported name, which varies in spelling, punctuation, abbreviation, and corporate form across filings and across the two agencies: the same firm appears as “Google LLC,” “GOOGLE INC,” and “Google Inc.” in different rows, and a single corporate parent often files under many distinct subsidiary and doing-business-as names. Neither the OFLC disclosures nor the USCIS Employer Data Hub publishes a clean, shared entity identifier, so the analyst must build one: at minimum, normalizing case, stripping punctuation and corporate suffixes, and collapsing obvious variants; at best, resolving names to a stable key such as an EIN where available or a resolved entity from another dataset. A naive join on the raw name field will fragment a large sponsor into several smaller ones and silently undercount the conversion from application to approval. The quality of an employer-level pipeline analysis is, in practice, mostly the quality of this name-normalization step.

The prevailing wage and the BLS benchmark

The reason the BLS wage data belongs in the pipeline at all is that the prevailing wage is benchmarked to it. The prevailing-wage attestation is the core worker-protection mechanism of the whole system: by requiring that the foreign worker be paid at least what the job pays in its local labor market, the program is meant to remove the wage incentive to substitute foreign labor for domestic labor. But “what the job pays in its local labor market” is not a judgment call left to each case—it is, for most determinations, read off the Occupational Employment and Wage Statistics survey for the relevant SOC occupation and area.

The OEWS does not publish a single “prevailing wage”; it publishes a distribution—a median and a set of percentiles for each occupation in each area. The foreign-labor-certification system maps that distribution onto four wage levels: Level I (entry, near the 17th percentile), Level II (qualified), Level III (experienced), and Level IV (fully competent, near the 67th percentile). The employer selects a level based on the job's requirements, and that choice sets the floor. This is exactly why all three datasets are needed to answer the central wage question. The DOL disclosures tell you the offered wage and the prevailing wage the employer used; the BLS OEWS tells you the full wage distribution for that occupation and area; and only by placing the offered wage against the OEWS percentiles can an analyst test whether a sponsored salary sits at the bottom or the middle of the local market—whether, for instance, a large share of experienced specialty-occupation roles is being certified at the entry-level Level I floor. The benchmark turns the bare attestation into a measurable claim.

The H-1B cap and the application-to-approval gap

The most important reason the labor data and the USCIS data tell different stories—and therefore must be joined rather than substituted for one another—is the H-1B numerical cap and its lottery. The H-1B program is statutorily limited each fiscal year (with a regular cap and an additional allotment for holders of US advanced degrees, and exemptions for universities and nonprofit and government research organizations). When registrations exceed the cap, as they routinely do by a large multiple, USCIS conducts a random selection—the lottery—to decide which employers may even file a petition. Many certified labor condition applications therefore never become petitions at all, because the registration was not selected.

This is the structural origin of the application-to-approval gap, and it is the single most valuable thing the joined data reveals. A certified LCA is a necessary but far-from-sufficient condition for an actual H-1B hire: between the certified application and the working employee stand the lottery and a separate USCIS adjudication that can still deny the petition. Counting certified LCAs as “H-1B workers” can overstate the workforce by a wide margin—which is precisely why the USCIS H-1B records exist in the analysis. By joining an employer's labor applications (the dol_oflc side) to its approved and denied petitions (the uscis_h1b side), an analyst can measure the conversion rate: how many of a firm's applications actually became approvals. Employers with thousands of LCAs but a far smaller count of approved petitions look very different once the lottery and adjudication funnel is made visible—and the gap itself, varying by employer and year, is a direct measurement of how the cap shapes who actually gets hired.

What the assembled pipeline answers

Assembled, the three datasets answer the questions at the center of the high-skilled-immigration debate—questions that no single source can settle on its own. They are worth stating plainly, because each maps onto a specific join.

Which employers file the most, and how many convert.Tallying labor applications by normalized employer and joining to the USCIS approval counts reveals both the concentration of the program—a relatively small number of large technology firms and IT-services and staffing companies file a large share of the applications—and the very different picture that emerges once the lottery and adjudication gap is applied. The application ranking and the approval ranking are not the same ranking, and the difference is the analysis. Whether sponsored wages clear the local floor.Joining offered wages by SOC and area to the BLS OEWS distribution tests, across the whole population of cases rather than by anecdote, whether sponsored pay sits at or above the prevailing wage and where in the local distribution it falls.

Which occupations and metros concentrate demand.Grouping by SOC and worksite geography—and normalizing by total OEWS employment in the occupation and area—maps where the country reaches hardest for foreign skilled labor relative to its existing workforce, distinguishing genuine concentration from the mere size of an occupation. And how the cap shapes the gapbetween applications and approvals is, as the previous section argued, the throughline: it is only by holding the labor applications and the adjudicated petitions side by side that the effect of the numerical limit on real hiring becomes measurable rather than assumed. In every one of these, the value is in the join: the debate is about employers, wages, occupations, and the cap, and those live in different datasets that only become an argument when keyed together.

Python workflow: tracing one employer through all three sources

The script below traces a single employer through the pipeline. It pulls the firm's labor applications from the OFLC disclosure file, its approved and denied H-1B petitions from the USCIS Employer Data Hub, and the BLS OEWS wage benchmark for each SOC occupation, then computes two of the core metrics: the application-to-approval conversion (approved petitions as a share of approved plus denied) and, for each occupation the employer sponsors, whether the median offered wage sits at or above the BLS median for that SOC. No API key is required—all three are public bulk downloads. The load-bearing helper is norm_employer, which performs the minimal employer-name normalization the join depends on; a production pipeline would replace it with real entity resolution.

import requests, io
import pandas as pd

# Tracing one employer through the employment-immigration pipeline.
#
# Three public, key-free federal sources are joined on the employer name
# and the SOC occupation code:
#   1. DOL OFLC disclosure files  -- the labor side (LCA / PERM applications)
#   2. USCIS H-1B Employer Data Hub -- approved vs. denied petitions
#   3. BLS OEWS wage data         -- the prevailing-wage benchmark by SOC + area
#
# Filenames and the fiscal-year path change each release, so isolate the URLs
# here and confirm them against the current dol.gov / uscis.gov / bls.gov pages.
LCA_URL  = "https://www.dol.gov/sites/dolgov/files/ETA/oflc/pdfs/LCA_Disclosure_Data_FY.xlsx"
HUB_URL  = "https://www.uscis.gov/sites/default/files/document/data/Employer_Data_Hub.csv"
OEWS_URL = "https://www.bls.gov/oes/special-requests/oesm_national.zip"


def _col(df, *needles):
    # Headers are renamed between releases; match the first column whose
    # name contains all of the needles (case-insensitive).
    for c in df.columns:
        u = str(c).upper()
        if all(n.upper() in u for n in needles):
            return c
    return None


def norm_employer(s):
    # The hard part: line up "GOOGLE LLC", "Google Inc.", "GOOGLE INC"
    # across three systems. Strip case, punctuation, and corporate suffixes.
    s = str(s).upper()
    for suf in (" INC", " LLC", " LP", " CORP", " CO", " LTD", " PLC", "."):
        s = s.replace(suf, "")
    return " ".join(s.split())


def labor_apps(employer):
    r = requests.get(LCA_URL, timeout=600); r.raise_for_status()
    df = pd.read_excel(io.BytesIO(r.content))
    emp, soc = _col(df, "EMPLOYER", "NAME"), _col(df, "SOC", "CODE")
    df["_emp"] = df[emp].map(norm_employer)
    return df[df["_emp"] == norm_employer(employer)], soc


def petitions(employer):
    r = requests.get(HUB_URL, timeout=300); r.raise_for_status()
    df = pd.read_csv(io.BytesIO(r.content))
    emp = _col(df, "Employer") or _col(df, "EMPLOYER")
    df["_emp"] = df[emp].map(norm_employer)
    return df[df["_emp"] == norm_employer(employer)]


def analyze(employer):
    apps, soc = labor_apps(employer)
    pet = petitions(employer)
    print(f"{employer}: {len(apps):,} labor applications on file")

    # --- Applications vs. approved petitions (the conversion gap) -------
    appr = _col(pet, "Initial", "Approval") or _col(pet, "Approval")
    deny = _col(pet, "Initial", "Denial") or _col(pet, "Denial")
    a = pd.to_numeric(pet[appr], errors="coerce").sum() if appr else 0
    d = pd.to_numeric(pet[deny], errors="coerce").sum() if deny else 0
    rate = a / max(a + d, 1)
    print(f"  USCIS petitions: {int(a):,} approved / {int(d):,} denied "
          f"({rate:.1%} approval)")

    # --- Offered wage vs. BLS prevailing benchmark by SOC --------------
    oews = pd.read_csv(OEWS_URL)  # national OEWS estimates, one row per SOC
    o_soc = _col(oews, "OCC", "CODE")
    o_med = _col(oews, "A_MEDIAN") or _col(oews, "MEDIAN")
    bench = oews.set_index(o_soc)[o_med].to_dict()
    wage = _col(apps, "WAGE", "RATE") or _col(apps, "WAGE")
    for code in apps[soc].dropna().unique()[:10]:
        offered = pd.to_numeric(apps[apps[soc] == code][wage],
                                errors="coerce").median()
        floor = pd.to_numeric(bench.get(code), errors="coerce")
        flag = "below" if offered < floor else "at/above"
        print(f"  {code}: offered ${offered:,.0f} vs BLS ${floor:,.0f} ({flag})")
    return apps, pet


analyze("GOOGLE LLC")

Three practical notes apply. First, the employer normalization in norm_employeris deliberately crude—it strips suffixes and punctuation but will not catch a parent filing under unrelated subsidiary names, nor will it merge a misspelling; serious work needs a fuzzy-matching or EIN-based resolution step, and the conversion rate is only as trustworthy as that step. Second, the wage comparison against the BLS national median is a coarse first pass: the prevailing wage is set by occupation and area, so a rigorous version must join the metropolitan OEWS estimates to the worksite geography, normalize the offered wage to an annual figure, restrict to certified cases, and compare against the matching percentile for the employer's selected wage level rather than a bare median. Third, the USCIS Employer Data Hub reports petition counts by employer and year but does not carry the SOC or wage detail; that detail lives only on the labor side, which is the entire reason the join is necessary. For national-scale work across all employers and many years, downloading the full files and building the entity-resolution layer once is far more efficient than per-employer scraping—and each source ships a record-layout document that supplies the authoritative, version-stamped column definitions for its release.

Limitations and analytical caveats

The assembled pipeline is the most complete public picture of employment-based immigration available, but joining three datasets that were never designed to be joined introduces limitations beyond those of any single source, and an analyst must internalize them before drawing conclusions.

Employer-name normalization is the dominant source of error. Because there is no shared entity key across the OFLC disclosures and the USCIS Employer Data Hub, every employer-level metric—application counts, approval counts, and the conversion rate that joins them—is hostage to the quality of name resolution. Under-matching fragments a sponsor and understates its activity; over-matching merges distinct firms and overstates it. The conversion rate in particular, which divides one agency's counts by another's, compounds whatever mismatch exists on each side. This is the first thing to validate and the last thing to trust in any cross-dataset claim.

A certified application is not an admitted worker, and the grains differ. The OFLC disclosure record is an application record at the grain of one row per case; the USCIS Employer Data Hub is an outcome record at the grain of one row per employer-year. They do not align one-to-one: a single LCA can cover multiple positions, the lottery removes many certified applications before they ever become petitions, and the petition counts mix new and continuing employment in ways the labor file does not. Any headcount that flows from the labor side alone should be read as a ceiling, and the join to USCIS is what brings it back toward reality—but the join relates counts, not individual cases, so it supports rates and ratios more confidently than exact person-level counts.

The wage benchmark is an approximation, and the data is attestation. The BLS OEWS join places offered wages against a survey distribution, but the actual prevailing-wage determination depends on the selected wage level, the precise area definition, and occasionally an employer-provided alternate wage source, so the OEWS median is a benchmark, not the exact floor the Department of Labor applied. Compounding this, the offered wage on the labor side is what the employer attested it would pay, certified for completeness rather than audited for truth—it is not verified realized compensation. Treating the offered wage as actual pay, or the OEWS median as the literal legal floor, over-reads both datasets.

Schemas, SOC vintages, and program rules all drift.The disclosure-file columns are renamed between releases, the USCIS Data Hub layout has changed, the SOC system has been revised (2010 versus 2018) requiring a crosswalk, and the wage-level methodology and H-1B cap rules have been adjusted repeatedly. A multi-year pipeline therefore demands careful harmonization on every axis, and apparent trends can reflect coding or rule changes rather than real shifts in employer behavior. Held with these caveats in mind, the joined dol_oflc, uscis_h1b, and bls_oews tables are a uniquely powerful resource: the only way to see the employment-immigration pipeline whole—the labor-market gate, the adjudicated outcome, and the wage benchmark that ties them—in a single view of who a US employer hires from abroad, for what work, at what pay, and how often the application becomes a worker.

Related writing

DOL OFLC Disclosures: The Federal Record of Employer Visa-Labor Applications — The labor side of the pipeline in full detail: the LCA and PERM filings, the prevailing-wage attestation, and the case-level disclosure files that supply the offered wage, the SOC occupation, and the employer this guide joins to the USCIS and BLS data.

USCIS H-1B Data: The Federal Record of Who Sponsors Skilled Foreign Workers — The adjudication side of the pipeline: the Employer Data Hub's approved and denied petition counts that, joined to the labor applications by employer, reveal the application-to-approval conversion the cap and lottery open up.

BLS OEWS: The Occupational Employment and Wage Statistics Behind Every Salary Benchmark — The wage benchmark that sets the prevailing-wage floor: the OEWS distribution by SOC and area, against which the offered wages on the labor side are measured to test whether sponsored pay clears the local market.