Technical writing

The Voidly measurement quality filter: how we clean 200M OONI records before ML training

November 13, 2024· 9 min read· AI Analytics

CensorshipVoidlyMLData engineering

Not every measurement that a Voidly probe uploads is fit for ML training. Before a raw probe result enters the feature-extraction pipeline, it passes through a quality filter — a set of boolean checks that discards measurements too incomplete, too old, or too corrupted to produce reliable training labels. This article covers what those checks are, why they exist, and what the 3.2% drop rate breaks down to across 200M+ raw OONI records.

This article sits between two others in the pipeline narrative: the OONI historical corpus article, which explains how we processed and published the raw OONI archive to HuggingFace, and the ML training data article, which explains how we label that data with Snorkel weak supervision. The quality filter is the bridge: the step that decides which raw measurements are fit to enter the labeling and feature-extraction pipeline at all.

Why quality filtering matters

The anomaly classifier is only as good as its training data. This is not a platitude — it is a statement about specific failure modes that arise when low-quality measurements enter the feature-extraction pipeline without screening.

A measurement with a missing DNS response but no control comparison tells you nothing. The feature extractor will impute the missing DNS fields with sentinel values, and the model will train on a record where the “DNS comparison” features are sentinel-vs-sentinel — a meaningless signal that the model may nonetheless learn to pattern-match on. A measurement from a probe running an old schema version (pre-2.5.0) uses different field names for TCP timing, meaning the extractor either silently produces null for fields that are actually populated or reads the wrong field entirely. A duplicate of a measurement that already passed is wasted compute at best; at worst, it inflates the apparent count of an event during the data-contamination audit that runs before each training cycle.

Each of these cases corrupts training labels if allowed through. The quality filter catches them at the Kafka consumer boundary, before the feature extractor ever sees the measurement.

The quality filter function

The filter is a single Python function that takes a raw measurement dict and returns a FilterResult — either a pass or a named drop reason. Keeping it as a single function rather than a class hierarchy makes it easy to unit-test every check in isolation and to add new checks without restructuring the consumer.

from dataclasses import dataclass
from typing import Optional

MIN_PROBE_VERSION = (2, 5, 0)

@dataclass
class FilterResult:
    passed: bool
    drop_reason: Optional[str]

def quality_filter(m: dict) -> FilterResult:
    """
    Returns FilterResult(passed=True) if the measurement is fit for
    ML training. Otherwise returns a FilterResult with a drop_reason.
    """
    # 1. Probe version check
    version = tuple(int(x) for x in m.get("probe_version", "0.0.0").split(".")[:3])
    if version < MIN_PROBE_VERSION:
        return FilterResult(False, "old_probe")

    # 2. Required fields must be present
    required = {"probe_cc", "probe_asn", "test_name", "measurement_start_time",
                "test_keys", "report_id"}
    missing = required - m.keys()
    if missing:
        return FilterResult(False, "missing_fields")

    # 3. Control comparison must have been attempted
    test_keys = m.get("test_keys", {})
    control_failure = test_keys.get("control_failure")
    if control_failure is not None and control_failure not in (None, ""):
        return FilterResult(False, "control_failure")

    # 4. At least one protocol layer must have produced a result
    has_dns  = test_keys.get("dns_experiment_failure") is not None or \
               test_keys.get("queries") not in (None, [])
    has_tcp  = test_keys.get("tcp_connect") not in (None, [])
    has_http = test_keys.get("requests") not in (None, [])
    if not (has_dns or has_tcp or has_http):
        return FilterResult(False, "missing_fields")

    # 5. Deduplication: report_id must be globally unique in the batch
    # (handled upstream by the dedup step; measurements that arrive here
    # have already passed the Redis 24h SHA-256 TTL check)
    return FilterResult(True, None)

Five checks, in order of cheapness: the version parse is a single tuple comparison; the required-field set subtraction is O(k) where k is fixed at six; the control check is a single dict lookup; the protocol-layer presence check is three dict lookups combined with an OR; and the dedup check is handled upstream before the function is called. The total cost is dominated by the Python function-call overhead, not the checks themselves.

Drop reason breakdown

Across the 200M+ measurements in the OONI historical corpus, the filter drops 3.2% of all records. The breakdown by reason:

Drop reason	Rate	Primary cause
control_failure	1.9%	ISP-level block of control server IPs, especially in CN/IR/RU
missing_fields	0.8%	Partial measurements from probes that lost connection mid-test
old_probe	0.3%	Probes running pre-2.5.0 with different `test_keys` field names
duplicate	0.2%	Rare: probe re-upload of a measurement that already cleared the Redis dedup TTL
Total dropped	3.2%

The dominant drop reason is control_failure at nearly three-fifths of all drops. The remaining two-fifths split between connectivity-induced partial measurements, old-probe schema incompatibilities, and the small tail of duplicates that slip through the Redis dedup window.

Why control_failure is the dominant drop reason

Control server connections are attempted from the probe's network to Voidly's three neutral control nodes. The control measurement is the baseline: it tells you what a clean connection to the target domain looks like from a network that is not censored. Without it, you cannot distinguish between “the target is blocked on this network” and “the target is down for everyone.”

In heavily censored networks — China, Iran, Russia — ISPs sometimes block control server IPs alongside the target domains. This is not targeted disruption of Voidly specifically; it is the side effect of broad IP-range blocking where the control node's datacenter IP range falls inside a blocked CIDR. The result is a measurement where test_keys.control_failure is set to a non-null value, making it impossible to determine whether the target measurement reflects censorship or network error.

These measurements are quarantined, not discarded. They go to a Kafka topic named measurements_quarantine with a 30-day retention policy. The quarantine exists because control failures in CN/IR/RU are themselves a signal worth preserving: a pattern of consistent control-server blockage from a specific ASN during a specific window is evidence of a network event that may be worth manual investigation. When a human analyst confirms a quarantined measurement as a genuine block via out-of-band evidence, it can be promoted back into the training pipeline with a manually assigned ground-truth label.

Probe version gate

Schema stability was achieved in probe version 2.5.0. Three breaking changes landed in that version that are material to the feature extractor:

Consistent test_keys.tcp_connect[].ip field. Prior to 2.5.0, the field was inconsistently formatted when the resolved address was IPv6 — some probe versions included square brackets (e.g. [2001:db8::1]), others did not. The feature extractor uses this field for ASN lookups and control comparison; the inconsistency produced silent lookup failures on IPv6 addresses in older records.
Normalized queries[].answers[].ipv4 vs. ipv6 fields. Before 2.5.0, both address families were returned in a single answer field with type discrimination left to the consumer. 2.5.0 split them into separate typed fields, making address-family-aware extraction straightforward without per-record type checks.
Added tls_handshake_last_operation. This field records the last TLS operation before a handshake failure, distinguishing a ClientHello-triggered reset from a certificate-verification failure from a timeout during the ServerHello. It is one of the features the TLS interference class relies on most heavily, and it simply does not exist in pre-2.5.0 records.

Allowing older probes through would require per-version schema normalization inside the feature extractor — conditional logic that branches on probe version for each affected field. That adds failure modes (missing branches for edge-case versions, silent mismatches when a version boundary is off by one) and complicates testing. The version gate keeps the normalization code simple by establishing a hard floor below which records are not processed.

The 0.3% drop rate from old_probe is shrinking. As probes auto-update and the tail of pre-2.5.0 deployments ages out, this drop reason is on a trajectory toward zero. The checks remain in place because the version gate is cheap and the cost of allowing a version-schema mismatch through is not.

Schema transformation after filtering

Measurements that pass the quality filter move immediately to the 47-feature pipeline. The handoff is a schema transformation function that extracts the fields the feature extractor needs and returns a normalized dict keyed to the ControlDelta schema:

def to_feature_input(m: dict) -> dict:
    """
    Extract the fields the feature extraction pipeline needs.
    Returns a normalized dict keyed to the ControlDelta schema.
    """
    test_keys = m["test_keys"]
    return {
        "probe_cc":     m["probe_cc"],
        "probe_asn":    m["probe_asn"],
        "test_start":   m["measurement_start_time"],
        "dns_failure":  test_keys.get("dns_experiment_failure"),
        "dns_answers":  test_keys.get("queries", []),
        "tcp_connect":  test_keys.get("tcp_connect", []),
        "tls_handshake":test_keys.get("tls_handshakes", []),
        "http_requests":test_keys.get("requests", []),
        "control_dns":  test_keys.get("control", {}).get("dns", {}),
        "control_tcp":  test_keys.get("control", {}).get("tcp_connect", {}),
        "control_http": test_keys.get("control", {}).get("http_request", {}),
    }

This function does minimal work: it extracts the nested structures the feature extractor will need and provides empty defaults where fields are absent. It does not compute features — that is entirely the feature extractor's job. Keeping the boundary clean means the quality filter and feature extractor can be tested independently: the filter is tested with raw measurement fixtures, the feature extractor is tested with ControlDelta schema fixtures, and neither test needs to know about the other's internals.

The ControlDelta schema is versioned. Each feature extractor release is paired with a schema version string, and the schema version is written into every training example so the training pipeline can confirm that the feature vectors in a batch were all extracted with the same schema. Schema version mismatches in a training batch are treated as a fatal error — mixing feature vectors from different schema versions produces silently incorrect training data that is harder to detect than a loud failure.

Throughput and pipeline position

The quality filter runs in the Kafka consumer, inline before the feature extraction step. This placement is deliberate: running the filter inside the consumer rather than as a separate service means there is no network hop between the filter decision and the feature-extraction call. A measurement that passes the filter is handed directly to the extractor in the same process.

At 45K rows/sec ingest rate — the throughput figure from the probe ingest pipeline — the filter adds roughly 0.3ms of latency per measurement. The latency is dominated by the Python interpreter overhead for the function call and the dataclass instantiation, not by the boolean logic. The checks themselves are negligible at this scale; the overhead is the fixed cost of dynamic dispatch in Python. A Rust implementation of the same checks would eliminate most of this, but 0.3ms at 45K rows/sec is 13.5 seconds of total filter time per million measurements — not a bottleneck worth optimizing at current volume.

Rejected measurements go to the measurements_quarantine Kafka topic with a 30-day retention policy. The quarantine topic is consumed by a separate audit process that produces daily drop-rate reports broken down by reason, country, and ASN. These reports feed two monitoring alerts: a spike in control_failure drops from a specific country is forwarded to the network event detection pipeline as a potential censorship signal; a spike in missing_fields drops from a specific probe version is forwarded to the probe engineering team as a potential regression.

The filter is also the point where the pipeline emits the measurement_quality Prometheus counter, which is used in the SLA dashboard to track the fraction of inbound measurements that are usable for training over rolling 1h, 24h, and 7d windows. A sustained usable-fraction below 95% (i.e., drop rate above 5%) triggers a pager alert: it means either the probe population has regressed, control servers are under broad ISP blockage, or a schema change has introduced a field-name mismatch that the filter is catching as missing_fields.

Deduplication and the Redis TTL window

The dedup step runs upstream of the quality filter, in the Cloudflare Worker that receives the raw probe upload. Each measurement is hashed (SHA-256 of the report_id) and checked against a Redis key-value store with a 24-hour TTL. If the hash is already present, the measurement is rejected at the ingress layer and never enters the Kafka pipeline at all.

The 0.2% duplicate drop rate at the quality filter represents measurements that slipped through the Redis window: a probe that re-uploaded a measurement after the 24-hour TTL expired, making the Redis check appear clean. These duplicates are caught downstream by the quality filter using a longer-lived dedup store — a PostgreSQL table of ingested report_id values that does not expire. Measurements caught here are dropped without going to the quarantine topic, since they are true duplicates rather than potentially valuable corrupted measurements.

The reason for the two-tier dedup is operational: the Redis 24h TTL keeps the hot-path check fast (sub-millisecond) and the Redis memory footprint bounded. The PostgreSQL table handles the long-tail duplicates that appear days or weeks after the original upload, at a cost of an additional DB round-trip that only applies to the small fraction of measurements that cleared Redis. The combined system catches >99.9% of duplicates; the residual 0.2% at the quality filter is the tail that requires the longer dedup window.

What the filter does not check

The quality filter is deliberately narrow. It checks for structural completeness and version compatibility; it does not attempt to validate the semantic content of a measurement. Concretely, it does not:

Validate IP address formats. The feature extractor handles malformed IP strings by catching parse exceptions and treating them as null values. Surfacing these as a drop reason would require parsing every IP in every measurement — expensive and redundant with the extractor's own error handling.
Detect measurement timestamp outliers. A measurement with a start time in 2019 is unusual but not structurally invalid — it might represent a delayed upload from a probe that was offline. The training pipeline's temporal split handles old measurements by placing them in the training set, not by rejecting them.
Validate country code or ASN plausibility. An implausible probe_cc (e.g., ZZ) is caught during feature extraction when the country embedding lookup returns a null embedding. That produces a null feature vector row, which is dropped before the training step. Adding a CC allowlist to the quality filter would require maintaining the list and create a coupling between the filter and the embedding vocabulary.

Each of these non-checks reflects a deliberate scope decision: the quality filter catches the categories of problems that corrupt the feature extractor's input in ways it cannot recover from. Problems that the extractor can handle with graceful degradation (null imputation, exception catching, downstream row drops) are not the quality filter's responsibility.

For the OONI historical corpus that provides the raw measurements this filter operates on: Building the OONI historical corpus: 1.66M downloads, schema normalization, and the decisions behind the dataset →

For the ML training pipeline that consumes quality-filtered measurements: Voidly's ML training pipeline: building a labeled censorship dataset from OONI measurements →

For the full probe-to-dataset ingest pipeline that runs the quality filter in the Kafka consumer: Voidly's probe-to-dataset ingest pipeline: normalization, quality filtering, and TimescaleDB indexing →

For the 47-feature pipeline that consumes the output of this quality filter: The 47 features that classify internet censorship: how Voidly extracts signal from raw network measurements →

OONI data normalization covers the upstream schema version detection and bitmask normalization step that produces the measurements this quality filter subsequently gates.