Technical writing

Building the OONI Historical Corpus: 1.66M Downloads, Schema Normalization, and the Decisions Behind the Dataset

· 8 min read· AI Analytics
CensorshipOONIData engineeringHuggingFace

The OONI historical corpus on HuggingFace has accumulated over 1.66 million downloads since we published it in Q4 2023. Most of those downloads are from ML researchers and data scientists who need a structured, flat CSV they can feed directly into a model — not the raw OONI archive, which requires significant pre-processing before it's usable. This post explains what we did to make that translation possible and why we made the decisions we did.

Why the raw OONI archive is hard to use directly

OONI's measurement archive is one of the largest public datasets on internet censorship in existence — hundreds of millions of measurements going back to 2012. The problem is that the raw archive is stored as one JSON file per measurement, and those JSON files don't follow a consistent schema across probe versions or measurement types.

The top-level structure is relatively stable (report_id, probe_cc, probe_asn, test_name, measurement_start_time), but the test_keys object — which contains the actual measurement result — has a completely different shape for every test type. A Web Connectivity measurement's test_keys look nothing like a DNS Check's or a Vanilla Tor's. And the schema within a single test type has drifted significantly across probe versions released between 2012 and 2024.

# Example: Web Connectivity test_keys (simplified)
{
  "test_keys": {
    "accessible": false,
    "blocking": "dns",
    "dns_consistency": "inconsistent",
    "dns_experiment_failure": null,
    "control_failure": null,
    "http_experiment_failure": "dns_nxdomain_error",
    "requests": [...],   # nested list of HTTP request/response pairs
    "tcp_connect": [...], # nested connection results
    "queries": [...]      # nested DNS query/response pairs
  }
}

# Example: Tor test_keys (completely different shape)
{
  "test_keys": {
    "tor_log": "...",
    "tor_progress": 100,
    "tor_progress_tag": "done",
    "tor_progress_summary": "...",
    "success": true,
    "targets": {...}  # nested per-bridge measurement
  }
}

For the historical corpus, we focused on the three test types that matter most for censorship detection at scale and have the most consistent schemas: Web Connectivity (HTTP/DNS/TCP blocking tests), DNS Check (resolver tampering), and Vanilla Tor (reachability of the Tor network). These three account for approximately 73% of all OONI measurements.

Schema normalization: what we extract from each test type

For each measurement type, we define a flat extraction schema — a fixed set of columns that can be populated from the test_keys, with null for fields the measurement doesn't provide.

# Web Connectivity extraction
WEB_COLUMNS = {
    'accessible': 'test_keys.accessible',
    'blocking_type': 'test_keys.blocking',           # dns | tcp_ip | http-failure | http-diff | false
    'dns_consistent': 'test_keys.dns_consistency',   # consistent | inconsistent | null
    'dns_failure': 'test_keys.dns_experiment_failure',
    'http_failure': 'test_keys.http_experiment_failure',
    'control_failure': 'test_keys.control_failure',
    # Derived: did the IP returned differ from the control IP?
    'ip_mismatch': lambda t: check_ip_mismatch(t['test_keys']),
    # Derived: does the returned IP belong to a known block-page ASN?
    'block_page_asn': lambda t: in_block_page_asn(t['test_keys']),
}

# DNS Check extraction (different test type, different keys)
DNS_COLUMNS = {
    'failure': 'test_keys.lookups.system.failure',
    'resolved_addrs': 'test_keys.lookups.system.addrs',
    'answer_count': lambda t: len(t['test_keys']['lookups']['system'].get('addrs', [])),
    'tls_consistent': 'test_keys.tls_consistency',
}

# Universal columns (present in every test type)
UNIVERSAL_COLUMNS = [
    'report_id', 'probe_cc', 'probe_asn', 'probe_network_name',
    'test_name', 'measurement_start_time', 'test_runtime', 'software_version',
]

Handling probe version schema drift

Between 2012 and 2024, the OONI probe released approximately 140 software versions. The test_keys schema for Web Connectivity alone has gone through 4 major revisions:

  • v0.1–0.3 (2012–2016): blocking field absent; accessibility inferred from body_length_match and status_code_match
  • v0.4–0.9 (2016–2019): blocking field added but with different null semantics (empty string vs JSON null vs Python None serialized)
  • v1.0–2.x (2019–2022): Schema stabilized; accessible boolean added as a derived field
  • v3.x+ (2022–present): Restructured for OONI backend v3; adds x_ prefix for experimental fields

We handle this with version-aware extraction functions. Each function tests for the presence of specific keys and falls back through a priority chain:

def extract_blocking_type(record: dict) -> str | None:
    """
    Extract the blocking type from a Web Connectivity record,
    handling schema differences across OONI probe versions.
    """
    tk = record.get('test_keys', {})

    # v3.x+ format
    if 'blocking' in tk and tk['blocking'] is not False:
        return str(tk['blocking']) if tk['blocking'] else None

    # v1.0–2.x format: synthesize from component signals
    if 'http_experiment_failure' in tk:
        failure = tk['http_experiment_failure']
        if failure and 'dns' in failure:
            return 'dns'
        if failure and 'connection' in failure:
            return 'tcp_ip'
        if failure:
            return 'http-failure'

    # v0.1–0.3 format: infer from match fields
    if 'body_length_match' in tk:
        if tk.get('body_length_match') is False and tk.get('status_code_match') is False:
            return 'http-diff'

    return None

Streaming 200M+ records without running out of memory

The full OONI archive is stored as a directory tree of gzipped JSON files organized by date, country, and report ID. The files range from a few KB to several MB. The total uncompressed volume is in the hundreds of gigabytes.

Loading even a single month of data into memory to process it as a batch would require more RAM than most workstations have. Instead, we use a generator pipeline that reads one JSON record at a time, applies the extraction schema, and streams the result to a CSV writer:

import gzip, json, csv
from pathlib import Path

def iter_records(archive_root: Path, test_name: str):
    """Yield raw records for a specific test type from the OONI archive."""
    for gz_file in sorted(archive_root.rglob(f"*{test_name}*.json.gz")):
        with gzip.open(gz_file, 'rb') as f:
            for line in f:
                try:
                    yield json.loads(line)
                except json.JSONDecodeError:
                    continue  # malformed records: skip

def build_corpus(archive_root: Path, output_csv: Path, test_name: str, columns: dict):
    with open(output_csv, 'w', newline='', encoding='utf-8') as out:
        writer = csv.DictWriter(out, fieldnames=list(UNIVERSAL_COLUMNS) + list(columns))
        writer.writeheader()

        for record in iter_records(archive_root, test_name):
            row = extract_universal(record)
            for col_name, extractor in columns.items():
                try:
                    if callable(extractor):
                        row[col_name] = extractor(record)
                    else:
                        # dot-notation path into nested dict
                        row[col_name] = get_nested(record, extractor)
                except (KeyError, TypeError):
                    row[col_name] = None
            writer.writerow(row)

# Peak memory: ~50MB regardless of archive size
build_corpus(Path('/data/ooni'), Path('web_connectivity.csv'),
             'web_connectivity', WEB_COLUMNS)

What we decided not to include

Three decisions reduced the corpus size significantly while improving usability:

  • Drop raw nested objects. Therequests, tcp_connect, and queriesarrays inside test_keys contain the full HTTP exchange, TCP handshake results, and DNS responses respectively. These are useful for deep-dive analysis but make the CSV unusable for most ML tasks (variable-length nested arrays can't be represented in flat tabular form). We extract derived signals from them (ip_mismatch, block_page_asn, connection_count) and drop the raw arrays.
  • Drop annotations and report-level metadata.Each OONI measurement includes reporter annotations, software version strings, engine name, and tunnel type. These are useful for provenance tracking but add substantial column count with low signal value for censorship ML models. Retained: software_version (as a proxy for schema version) andprobe_network_name.
  • Temporal cutoff at 2022-01-01 for pre-v1 records.Records before 2022 that use the pre-v1 schema have significantly more missing fields after normalization. Rather than include them at reduced quality, we treat the full-schema period (2022–present) as the primary corpus and include pre-2022 records as a separate legacy partition with a documented quality warning.

The final schema

The published dataset has 24 columns for Web Connectivity records:

# Core identity
report_id           # OONI report ID (globally unique)
measurement_start_time  # UTC timestamp
probe_cc            # ISO 3166-1 alpha-2 country code
probe_asn           # AS number (numeric)
probe_network_name  # ISP name as reported by the probe
input               # URL tested

# Classification
test_name           # web_connectivity | dns_check | tor
accessible          # boolean: was the site reachable?
blocking_type       # dns | tcp_ip | http-failure | http-diff | null

# DNS layer
dns_consistent      # consistent | inconsistent | null
dns_failure         # OONI failure string or null
ip_mismatch         # bool: returned IP differs from control
block_page_asn      # bool: returned IP in known block-page AS

# HTTP layer
http_failure        # OONI failure string or null
status_code         # HTTP response code
body_length_match   # bool: response body length matches control
status_code_match   # bool: response status matches control

# Control comparison
control_failure     # failure reaching the control server (not ISP)

# Derived confidence
anomaly             # bool: OONI's top-level anomaly flag
confirmed           # bool: OONI confirmed this as a block

# Provenance
software_version    # probe software version (schema quality proxy)
test_runtime        # seconds the measurement took

Coverage and adoption

The published corpus covers January 2022 – October 2024 for Web Connectivity, with 180M+ rows. DNS Check adds 22M rows. Vanilla Tor adds 12M rows. Total compressed CSV size: approximately 18GB.

At 1.66M downloads, the corpus is used in academic papers on censorship measurement, in journalism workflows for tracking ISP-level blocking events, and as training data for ML classifiers — including our own anomaly classifier, which uses a labeled subset as positive examples for the DNS tampering, TLS interference, and HTTP blocking classes.


For how the Voidly probe generates the complementary measurement dataset: The Voidly Probe: Tauri + boringtun network measurement at the operator's edge →

For how OONI data is used alongside CensoredPlanet and IODA for cross-source verification: Cross-source censorship verification: reconciling OONI, CensoredPlanet, and IODA →

For how this corpus is turned into a labeled training dataset using weak supervision and label functions: Voidly's ML training pipeline: building a labeled censorship dataset from OONI measurements →

For the ML classifier that uses this corpus as labeled training data: The Voidly anomaly classifier: five interference classes and why we optimize for recall →

For the schema of the live Voidly measurement dataset that this corpus complements: The Voidly measurement dataset: field-by-field schema reference →

For how both this corpus and the global-censorship-index are hosted on HuggingFace — Parquet access patterns, daily updates, and filter recipes: The Voidly open datasets on HuggingFace: structure, daily snapshots, and filter recipes →

For how OONI measurement entity data (ISPs, domains, government agencies) connects to our OSINT entity profiling infrastructure for censorship attribution: Building a digital-footprint reconnaissance pipeline for OSINT investigations →

For the quality_filter() function that gates measurements from this corpus before they reach ML feature extraction — probe version checks, control_failure handling, and the to_feature_input() transformation: Voidly measurement quality filtering: gating probe data before ML feature extraction →

Censorship attribution via OSINT takes the OONI corpus data further, combining it with procurement records and DPI fingerprints to attribute blocking to specific vendors and infrastructure operators.

OONI data normalization covers the schema version detection, anomaly bitmasks, and 95.3% pass-through logic used when preparing the historical corpus for ML training.