Technical writing
Building the OONI Historical Corpus: 1.66M Downloads, Schema Normalization, and the Decisions Behind the Dataset
The OONI historical corpus on HuggingFace has accumulated over 1.66 million downloads since we published it in Q4 2023. Most of those downloads are from ML researchers and data scientists who need a structured, flat CSV they can feed directly into a model — not the raw OONI archive, which requires significant pre-processing before it's usable. This post explains what we did to make that translation possible and why we made the decisions we did.
Why the raw OONI archive is hard to use directly
OONI's measurement archive is one of the largest public datasets on internet censorship in existence — hundreds of millions of measurements going back to 2012. The problem is that the raw archive is stored as one JSON file per measurement, and those JSON files don't follow a consistent schema across probe versions or measurement types.
The top-level structure is relatively stable (report_id, probe_cc, probe_asn, test_name, measurement_start_time), but the test_keys object — which contains the actual measurement result — has a completely different shape for every test type. A Web Connectivity measurement's test_keys look nothing like a DNS Check's or a Vanilla Tor's. And the schema within a single test type has drifted significantly across probe versions released between 2012 and 2024.
# Example: Web Connectivity test_keys (simplified)
{
"test_keys": {
"accessible": false,
"blocking": "dns",
"dns_consistency": "inconsistent",
"dns_experiment_failure": null,
"control_failure": null,
"http_experiment_failure": "dns_nxdomain_error",
"requests": [...], # nested list of HTTP request/response pairs
"tcp_connect": [...], # nested connection results
"queries": [...] # nested DNS query/response pairs
}
}
# Example: Tor test_keys (completely different shape)
{
"test_keys": {
"tor_log": "...",
"tor_progress": 100,
"tor_progress_tag": "done",
"tor_progress_summary": "...",
"success": true,
"targets": {...} # nested per-bridge measurement
}
}For the historical corpus, we focused on the three test types that matter most for censorship detection at scale and have the most consistent schemas: Web Connectivity (HTTP/DNS/TCP blocking tests), DNS Check (resolver tampering), and Vanilla Tor (reachability of the Tor network). These three account for approximately 73% of all OONI measurements.
Schema normalization: what we extract from each test type
For each measurement type, we define a flat extraction schema — a fixed set of columns that can be populated from the test_keys, with null for fields the measurement doesn't provide.
# Web Connectivity extraction
WEB_COLUMNS = {
'accessible': 'test_keys.accessible',
'blocking_type': 'test_keys.blocking', # dns | tcp_ip | http-failure | http-diff | false
'dns_consistent': 'test_keys.dns_consistency', # consistent | inconsistent | null
'dns_failure': 'test_keys.dns_experiment_failure',
'http_failure': 'test_keys.http_experiment_failure',
'control_failure': 'test_keys.control_failure',
# Derived: did the IP returned differ from the control IP?
'ip_mismatch': lambda t: check_ip_mismatch(t['test_keys']),
# Derived: does the returned IP belong to a known block-page ASN?
'block_page_asn': lambda t: in_block_page_asn(t['test_keys']),
}
# DNS Check extraction (different test type, different keys)
DNS_COLUMNS = {
'failure': 'test_keys.lookups.system.failure',
'resolved_addrs': 'test_keys.lookups.system.addrs',
'answer_count': lambda t: len(t['test_keys']['lookups']['system'].get('addrs', [])),
'tls_consistent': 'test_keys.tls_consistency',
}
# Universal columns (present in every test type)
UNIVERSAL_COLUMNS = [
'report_id', 'probe_cc', 'probe_asn', 'probe_network_name',
'test_name', 'measurement_start_time', 'test_runtime', 'software_version',
]Handling probe version schema drift
Between 2012 and 2024, the OONI probe released approximately 140 software versions. The test_keys schema for Web Connectivity alone has gone through 4 major revisions:
- v0.1–0.3 (2012–2016):
blockingfield absent; accessibility inferred frombody_length_matchandstatus_code_match - v0.4–0.9 (2016–2019):
blockingfield added but with different null semantics (empty string vs JSON null vs Python None serialized) - v1.0–2.x (2019–2022): Schema stabilized;
accessibleboolean added as a derived field - v3.x+ (2022–present): Restructured for OONI backend v3; adds
x_prefix for experimental fields
We handle this with version-aware extraction functions. Each function tests for the presence of specific keys and falls back through a priority chain:
def extract_blocking_type(record: dict) -> str | None:
"""
Extract the blocking type from a Web Connectivity record,
handling schema differences across OONI probe versions.
"""
tk = record.get('test_keys', {})
# v3.x+ format
if 'blocking' in tk and tk['blocking'] is not False:
return str(tk['blocking']) if tk['blocking'] else None
# v1.0–2.x format: synthesize from component signals
if 'http_experiment_failure' in tk:
failure = tk['http_experiment_failure']
if failure and 'dns' in failure:
return 'dns'
if failure and 'connection' in failure:
return 'tcp_ip'
if failure:
return 'http-failure'
# v0.1–0.3 format: infer from match fields
if 'body_length_match' in tk:
if tk.get('body_length_match') is False and tk.get('status_code_match') is False:
return 'http-diff'
return NoneStreaming 200M+ records without running out of memory
The full OONI archive is stored as a directory tree of gzipped JSON files organized by date, country, and report ID. The files range from a few KB to several MB. The total uncompressed volume is in the hundreds of gigabytes.
Loading even a single month of data into memory to process it as a batch would require more RAM than most workstations have. Instead, we use a generator pipeline that reads one JSON record at a time, applies the extraction schema, and streams the result to a CSV writer:
import gzip, json, csv
from pathlib import Path
def iter_records(archive_root: Path, test_name: str):
"""Yield raw records for a specific test type from the OONI archive."""
for gz_file in sorted(archive_root.rglob(f"*{test_name}*.json.gz")):
with gzip.open(gz_file, 'rb') as f:
for line in f:
try:
yield json.loads(line)
except json.JSONDecodeError:
continue # malformed records: skip
def build_corpus(archive_root: Path, output_csv: Path, test_name: str, columns: dict):
with open(output_csv, 'w', newline='', encoding='utf-8') as out:
writer = csv.DictWriter(out, fieldnames=list(UNIVERSAL_COLUMNS) + list(columns))
writer.writeheader()
for record in iter_records(archive_root, test_name):
row = extract_universal(record)
for col_name, extractor in columns.items():
try:
if callable(extractor):
row[col_name] = extractor(record)
else:
# dot-notation path into nested dict
row[col_name] = get_nested(record, extractor)
except (KeyError, TypeError):
row[col_name] = None
writer.writerow(row)
# Peak memory: ~50MB regardless of archive size
build_corpus(Path('/data/ooni'), Path('web_connectivity.csv'),
'web_connectivity', WEB_COLUMNS)What we decided not to include
Three decisions reduced the corpus size significantly while improving usability:
- Drop raw nested objects. The
requests,tcp_connect, andqueriesarrays inside test_keys contain the full HTTP exchange, TCP handshake results, and DNS responses respectively. These are useful for deep-dive analysis but make the CSV unusable for most ML tasks (variable-length nested arrays can't be represented in flat tabular form). We extract derived signals from them (ip_mismatch, block_page_asn, connection_count) and drop the raw arrays. - Drop annotations and report-level metadata.Each OONI measurement includes reporter annotations, software version strings, engine name, and tunnel type. These are useful for provenance tracking but add substantial column count with low signal value for censorship ML models. Retained:
software_version(as a proxy for schema version) andprobe_network_name. - Temporal cutoff at 2022-01-01 for pre-v1 records.Records before 2022 that use the pre-v1 schema have significantly more missing fields after normalization. Rather than include them at reduced quality, we treat the full-schema period (2022–present) as the primary corpus and include pre-2022 records as a separate legacy partition with a documented quality warning.
The final schema
The published dataset has 24 columns for Web Connectivity records:
# Core identity report_id # OONI report ID (globally unique) measurement_start_time # UTC timestamp probe_cc # ISO 3166-1 alpha-2 country code probe_asn # AS number (numeric) probe_network_name # ISP name as reported by the probe input # URL tested # Classification test_name # web_connectivity | dns_check | tor accessible # boolean: was the site reachable? blocking_type # dns | tcp_ip | http-failure | http-diff | null # DNS layer dns_consistent # consistent | inconsistent | null dns_failure # OONI failure string or null ip_mismatch # bool: returned IP differs from control block_page_asn # bool: returned IP in known block-page AS # HTTP layer http_failure # OONI failure string or null status_code # HTTP response code body_length_match # bool: response body length matches control status_code_match # bool: response status matches control # Control comparison control_failure # failure reaching the control server (not ISP) # Derived confidence anomaly # bool: OONI's top-level anomaly flag confirmed # bool: OONI confirmed this as a block # Provenance software_version # probe software version (schema quality proxy) test_runtime # seconds the measurement took
Coverage and adoption
The published corpus covers January 2022 – October 2024 for Web Connectivity, with 180M+ rows. DNS Check adds 22M rows. Vanilla Tor adds 12M rows. Total compressed CSV size: approximately 18GB.
At 1.66M downloads, the corpus is used in academic papers on censorship measurement, in journalism workflows for tracking ISP-level blocking events, and as training data for ML classifiers — including our own anomaly classifier, which uses a labeled subset as positive examples for the DNS tampering, TLS interference, and HTTP blocking classes.
For how the Voidly probe generates the complementary measurement dataset: The Voidly Probe: Tauri + boringtun network measurement at the operator's edge →
For how OONI data is used alongside CensoredPlanet and IODA for cross-source verification: Cross-source censorship verification: reconciling OONI, CensoredPlanet, and IODA →
For how this corpus is turned into a labeled training dataset using weak supervision and label functions: Voidly's ML training pipeline: building a labeled censorship dataset from OONI measurements →
For the ML classifier that uses this corpus as labeled training data: The Voidly anomaly classifier: five interference classes and why we optimize for recall →
For the schema of the live Voidly measurement dataset that this corpus complements: The Voidly measurement dataset: field-by-field schema reference →
For how both this corpus and the global-censorship-index are hosted on HuggingFace — Parquet access patterns, daily updates, and filter recipes: The Voidly open datasets on HuggingFace: structure, daily snapshots, and filter recipes →
For how OONI measurement entity data (ISPs, domains, government agencies) connects to our OSINT entity profiling infrastructure for censorship attribution: Building a digital-footprint reconnaissance pipeline for OSINT investigations →
For the quality_filter() function that gates measurements from this corpus before they reach ML feature extraction — probe version checks, control_failure handling, and the to_feature_input() transformation: Voidly measurement quality filtering: gating probe data before ML feature extraction →
Censorship attribution via OSINT takes the OONI corpus data further, combining it with procurement records and DPI fingerprints to attribute blocking to specific vendors and infrastructure operators.
OONI data normalization covers the schema version detection, anomaly bitmasks, and 95.3% pass-through logic used when preparing the historical corpus for ML training.