Technical writing

The Voidly measurement scheduler: how we decide which domains to probe and when

December 20, 2024· 9 min read· AI Analytics

CensorshipVoidlyMethodologyInfrastructure

Probing 80 domains across 4 protocols from 37+ vantage points every 5 minutes sounds mechanical. In practice it is a resource-allocation problem with at least three competing constraints: measurement coverage (every domain in every country, on every relevant protocol), probe safety (a probe that hammers all 80 sites in 30 seconds looks like a scanner to a deep-packet inspection system), and freshness (when censorship is actively changing, the most important domains need sub-5-minute resolution, not a 20-minute slot in a round-robin queue).

The measurement scheduler is the system that reconciles those constraints. It runs on the control server, not on any individual probe. At the start of each 5-minute window it generates a MeasurementPlan for every active probe — a prioritized, jittered task list telling the probe exactly which domains to measure, via which protocols, and in what order. This post documents how that plan is constructed.

The measurement budget problem

The naive framing of our measurement surface is: 80 domains × 4 protocols (DNS, TCP/443, HTTP, HTTPS) × 200 countries = 64,000 measurement slots per 5-minute window. That number is manageable in aggregate, but it obscures a structural asymmetry in how our probe network is distributed.

Our 37+ probes do not cover 200 countries equally. High-risk countries — those with active censorship regimes and the highest demand for measurement data — have multiple probes on multiple ASNs. A country like Germany or Japan may have one or two probes primarily for baseline comparison. Countries we have no probe presence in at all are covered only by OONI and CensoredPlanet in our cross-source reconciliation layer; Voidly contributes no direct measurement data there.

For a country with 4 active probes, a naive full-coverage schedule would generate 80 × 4 = 320 tasks per probe per 5-minute window. At 3 concurrent measurement threads per probe (the maximum we set to avoid triggering rate limits), a 320-task queue takes approximately 4–5 minutes to complete, which leaves essentially no slack in the window. Any probe slowdown — a high-latency TLS handshake, a stalled HTTP response waiting for a timeout — pushes measurements into the next window and creates a gap in coverage exactly when we need it most.

The scheduler's job is to reduce that 320-task naive budget to something closer to 50 tasks per window while preserving coverage for the domains and protocols that actually reveal censorship. In steady state, across the global probe network, we average 49 tasks per 5-minute window per probe — a 37% reduction from naive full coverage.

The MeasurementPlan data model

The scheduler produces two dataclasses. MeasurementTask represents a single domain + protocol combination that a probe should execute. MeasurementPlan is the complete set of tasks the probe should run in a given 5-minute window, along with constraints on concurrency and timing:

@dataclass
class MeasurementTask:
    domain: str
    protocols: list[str]     # ['dns', 'tcp', 'http', 'https'] or subset
    priority: int            # 0 (low) to 10 (high)
    jitter_ms: int           # random delay before executing this task
    expected_duration_ms: int

@dataclass
class MeasurementPlan:
    probe_id: str
    window_start: datetime
    window_duration_s: int   # 300 (5 minutes)
    tasks: list[MeasurementTask]
    max_concurrent: int      # typically 3 (avoids rate-limit triggers)

The probe receives its plan at the start of each window via the WireGuard tunnel. It executes tasks in priority order, respecting the jitter_ms offset on each task and never exceeding max_concurrent parallel requests. Tasks that do not complete within the window are abandoned — we do not carry incomplete tasks across window boundaries. Instead, the scheduler compensates in the next window via the recency boost described below.

The plan is generated server-side rather than on the probe for two reasons. First, the priority computation requires database queries — recent anomaly rates, hours since last measurement, historical block patterns — that the probe does not have access to. Second, centralized scheduling lets us coordinate across probes in the same country: we can ensure two probes on different ASNs measure the same domain in the same window without duplicating that logic on each device.

Domain priority tiers and how they are computed

Not all 80 domains are equally urgent. The scheduler assigns each domain a priority score from 0 to 10 for each country, computed from three inputs: a category base score, a recent anomaly boost, and a recency boost for domains that have not been measured recently.

The base score comes from the domain's OONI category code. We use a simplified version of OONI's category taxonomy that collapses the full list down to the categories that appear in our test list:

CATEGORY_PRIORITY = {
    "NEWS":  8,   # news_media — high priority
    "SMG":   8,   # social media — high priority
    "HUMR":  7,   # human rights
    "POLR":  6,   # political criticism
    "COMM":  5,   # communication tools (VoIP, messaging)
    "ANON":  6,   # anonymization tools
    "REL":   4,   # religion
    "XPO":   3,   # pornography
    "GAME":  2,   # gaming — low priority
    # etc.
}

News and social media sit at 8 because they are the most commonly blocked categories globally and the most time-sensitive for journalists. Anonymization tools (VPN landing pages, Tor Project) are at 6 because their blocking is frequently co-incident with broader access restrictions — a country that starts blocking VPN providers often escalates to broader blocks within days. Gaming sites are at 2; we include them for false-positive calibration (a country that appears to block a gaming site is probably experiencing a routing issue, not targeted censorship) but they do not need high-frequency measurement.

On top of the base score, the scheduler applies two dynamic boosts:

def compute_domain_priority(domain: str, country_cc: str) -> int:
    base = CATEGORY_PRIORITY.get(domain_category(domain), 5)

    # Boost for recent anomalies
    recent_anomaly_score = db.query_recent_anomaly_rate(
        domain=domain,
        country_cc=country_cc,
        window_hours=6
    )
    anomaly_boost = min(3, int(recent_anomaly_score * 10))

    # Boost for time since last measurement
    hours_since_last = db.hours_since_last_measurement(domain, country_cc)
    recency_boost = min(2, int(hours_since_last / 2))

    return min(10, base + anomaly_boost + recency_boost)

The anomaly boost can add up to 3 points. A domain with a 30% anomaly rate in the past 6 hours (anomaly score = 0.30) gets an anomaly boost of 3, pushing a NEWS domain from 8 to 10 — maximum priority for every window. The recency boost adds up to 2 points for domains that have not been measured in the past 4 hours, which prevents low-priority domains from going unmeasured indefinitely when the schedule is compressed.

The priority score determines measurement frequency. High-priority domains (score ≥ 7) are measured in every 5-minute window. Medium-priority domains (score 4–6) are measured every other window — effectively every 10 minutes. Low-priority domains (score ≤ 3) are measured every 4 windows, or every 20 minutes. This tiering accounts for the bulk of the per-probe load reduction: a test list where roughly 40% of domains are in the low-priority tier reduces the average per-window task count from 80 to approximately 50 before jitter and protocol selection are applied.

Protocol selection per domain

Even after the frequency reduction, there is a second dimension of budget control: not every domain needs all 4 protocols in every window. The scheduler selects the protocol subset based on what has historically been informative for a given domain in a given country:

def select_protocols(domain: str, country_cc: str) -> list[str]:
    always = ['dns', 'https']  # always measure at minimum

    # Add HTTP if the site serves HTTP or has HTTP-specific blocking history
    if site_serves_http(domain) or has_http_block_history(domain, country_cc):
        always.append('http')

    # Add TCP if there is history of TCP RST injection or null-routing
    if has_tcp_block_history(domain, country_cc):
        always.append('tcp')

    return list(set(always))

DNS and HTTPS are always included. DNS because it is the most common interference vector globally — DNS tampering accounts for the majority of censorship events in our dataset. HTTPS because TLS handshake interference (connection reset, invalid certificate injection) is the second most common. HTTP is added only when the domain has a known HTTP endpoint or when we have previously seen HTTP-specific blocking in the country — a block page served on HTTP/80 while HTTPS/443 returns a timeout, for instance. TCP is added specifically when we have seen TCP RST injection or null-routing in the country's history for this domain.

In practice, most domains in low-censorship countries are measured on 2 protocols (DNS + HTTPS). Domains in high-censorship countries with active block histories are measured on all 4. The weighted average across the global probe network is 2.7 protocols per domain per window — a meaningful reduction from the naive assumption of 4.

Jitter and anti-detection

A probe that queries all 50 of its assigned domains in rapid burst — one after another within a few seconds — is behaviorally distinguishable from normal user traffic. ISPs operating deep-packet inspection infrastructure in censorship-heavy jurisdictions have deployed systems that flag bursty multi-domain request patterns originating from the same IP address. We have seen evidence of this in anomaly rates for probes that have been running for extended periods in CN, RU, and IR: probes that run unjittered cycles accumulate steadily worsening false-negative rates as the ISP appears to begin routing their traffic through a lighter-touch inspection path (or silently disrupts their measurements in ways that produce apparent success rather than detectable failure).

The scheduler defends against this by spreading each probe's task list across the 5-minute window, with per-task randomization:

def apply_jitter(tasks: list[MeasurementTask], window_s: int) -> list[MeasurementTask]:
    # Spread tasks across the window, with ±15% randomness
    base_interval_ms = (window_s * 1000) // len(tasks)
    for i, task in enumerate(tasks):
        task.jitter_ms = int(
            i * base_interval_ms +
            random.uniform(-0.15 * base_interval_ms, 0.15 * base_interval_ms)
        )
    return tasks

For a 50-task plan in a 300-second window, the base interval is 6,000 ms per task. With ±15% jitter, the inter-task delay varies from roughly 5.1 to 6.9 seconds. The resulting traffic pattern — spaced HTTPS requests mixed with occasional DNS queries and HTTP requests — is consistent with normal browser behavior for a user visiting multiple sites sequentially.

For probes in high-risk countries (CN, RU, IR, BY, VN), the scheduler applies two additional anti-detection measures beyond basic jitter. First, the domain order within the plan is fully randomized on each window — the same domain does not appear at the same position in the sequence across successive windows. Second, a random 10–15% of tasks are deliberately skipped and deferred to the next window. This means no two consecutive 5-minute cycles from the same probe look identical in timing or coverage, which defeats statistical pattern-matching against an expected probe signature. The skipped tasks are automatically compensated by the recency boost in the priority calculation, so they are scheduled with elevated priority in the following window.

ASN distribution across probes in the same country

For countries with multiple active probes, the scheduler has an additional coordination task: ensuring that each domain is measured from at least 2 different ASNs in each window. This is the foundation of the cross_asn_consistency field in the dataset — a field that is only meaningful when the underlying measurements actually came from different autonomous systems rather than two probes on the same ISP.

def assign_asn_distribution(
    probes: list[ProbeInfo],
    domains: list[str]
) -> dict[str, list[str]]:
    """Returns {domain: [probe_id, ...]} assignment."""
    # Group probes by ASN
    asn_groups = group_by_asn(probes)

    assignment = {}
    for domain in domains:
        # Pick at least one probe from each ASN group
        chosen = []
        for asn, asn_probes in asn_groups.items():
            chosen.append(random.choice(asn_probes))
        assignment[domain] = [p.probe_id for p in chosen]

    return assignment

If a country has probes on 3 ASNs, every domain in the high-priority tier will be measured from all 3 ASNs in each window. Medium-priority domains are assigned to at least 2 ASN groups per window. Low-priority domains may be assigned to only 1 ASN group in a given window, rotating across ASNs over successive windows so that coverage across all vantage points is maintained over a 20-minute horizon.

The ASN assignment runs before jitter is applied, so different probes assigned the same domain may execute their measurements at different times within the same window. This is intentional: if two probes hit the same domain within milliseconds of each other, any ISP-level rate limiting could affect both simultaneously, biasing the cross-ASN comparison. Staggered execution within the window produces a more independent measurement pair.

Adaptive scheduling on anomaly detection

The priority system described above operates on a 6-hour lookback window for anomaly history. That latency is acceptable for steady-state scheduling — a domain's priority at the start of a new window reflects everything that happened in the past 6 hours. But when a block starts in real time, a 6-hour lookback introduces unnecessary lag: a domain that just started being blocked 5 minutes ago has zero anomaly history in the lookback window, and will be scheduled at its baseline priority rather than at elevated urgency.

The real-time pipeline closes this gap. When the anomaly classifier crosses the detection confidence threshold for a new event, it sends a HighPrioritySignal directly to the scheduler:

@dataclass
class HighPrioritySignal:
    domain: str
    country_cc: str
    anomaly_type: str  # 'DNS_TAMPERING' | 'TLS_INTERFERENCE' | 'HTTP_BLOCK'
    confidence: float
    detected_at: datetime

def handle_high_priority_signal(signal: HighPrioritySignal):
    # Immediately inject domain into next window for all probes in this country
    # with elevated measurement budget (all 4 protocols, 2× frequency, 3-window duration)
    scheduler.inject_urgent(
        domain=signal.domain,
        country_cc=signal.country_cc,
        protocols=['dns', 'tcp', 'http', 'https'],
        windows=6,  # measure every window for the next 30 minutes
        priority=10
    )

The inject_urgent call bypasses the normal priority computation entirely. The domain is injected into the next 6 consecutive windows — 30 minutes of elevated measurement — on all 4 protocols regardless of the protocol-selection logic described earlier. This is the most expensive operation the scheduler performs: 6 windows × 4 protocols × N probes in the affected country can add a significant number of tasks to each probe's queue. The max_concurrent limit and the window duration act as a natural safety valve — the probe simply executes as many tasks as it can within the 300-second window and defers the rest.

After the 6-window urgent period expires, the domain reverts to the normal priority computation. By that point, the 6-hour anomaly lookback window will contain the measurements collected during the urgent period, so the anomaly boost will kick in and maintain elevated (though not maximum) priority for as long as the block persists.

Window alignment and cross-probe synchronization

All probes — regardless of their geographic location or local system clock — are synchronized to the same 5-minute window boundaries, aligned to the Unix epoch:

def current_window_start() -> datetime:
    """Returns the start of the current 5-minute measurement window."""
    now = int(time.time())
    window_start_ts = now - (now % 300)  # t % 300 == 0
    return datetime.utcfromtimestamp(window_start_ts)

def next_window_start() -> datetime:
    return current_window_start() + timedelta(seconds=300)

This alignment is what makes cross-probe comparison meaningful. When we compute cross_asn_consistency for a domain in a given country, we join measurements from different probes on their shared window_starttimestamp. Without epoch-aligned windows, “same window” would have no precise definition, and probes whose clocks drifted by even 30 seconds could end up in different windows, breaking the cross-probe join.

In practice, probe clock drift is managed by requiring NTP synchronization as a precondition for the probe to connect to the control server. The WireGuard handshake itself enforces a maximum clock skew — WireGuard rejects handshakes from peers whose timestamps differ by more than 3 minutes. We treat any probe with observed clock drift above 5 seconds as a quality warning; measurements from that probe are flagged in the dataset until clock sync is restored.

Scheduler metrics in steady state

The differences in per-window task counts across countries reflect the interplay between the priority system and anomaly history. Countries with active censorship regimes accumulate more anomaly history, which elevates more domains into the high-priority tier, which increases the number of tasks scheduled per window. The adaptive injection mechanism adds further load during active incidents:

CN — 68 tasks per window (persistent high anomaly rate across multiple domain categories)
RU — 72 tasks per window (elevated since 2022; many NEWS and SMG domains in the high-priority tier)
IR — 74 tasks per window (highest task count; active blocks across all 4 protocols for many domains)
TR — 58 tasks per window (intermittent blocking; moderate anomaly history)
DE — 43 tasks per window (low anomaly rate; most domains in the medium or low-priority tier)
Global average — 49 tasks per 5-minute window per probe

The RU figure of 72 is worth highlighting. It exceeds CN (68) despite China being the canonical high-censorship environment in most discussions of internet freedom. The reason is measurement profile, not censorship severity: China's blocking is largely stable and well-characterized, so domains that are consistently blocked accumulate a moderate anomaly boost rather than a maximum one (a domain blocked 100% of the time has no signal variance and does not trigger the adaptive injection mechanism). Russia's blocking regime has been more volatile in the measurement period, with blocks appearing, disappearing, and reappearing for the same domains — which produces higher anomaly rate variance and therefore more recurrent priority boosts and more adaptive injections.

Iran's 74 tasks per window is the highest in the network and reflects the breadth of active blocking: not just NEWS and SMG categories but HUMR, COMM, and ANON domains all showing active interference in recent measurement history. When the adaptive injection fires for multiple domains simultaneously — as it does during coordinated blocking events in IR — the per-probe task count can briefly spike above 90 before the 6-window urgent period expires and task counts normalize.

The 43-task figure for DE represents the lower bound of useful measurement. Even in a jurisdiction with essentially no censorship, measuring fewer than 40 domains per window would leave too many gaps in coverage for meaningful longitudinal tracking. The recency boost in the priority calculation ensures that no domain goes unmeasured for more than 20 minutes regardless of its base priority, which sets a floor under the per-window task count in low-anomaly countries.

The probe that executes these measurement plans — Tauri 2, boringtun WireGuard, and why keys never leave the device: The Voidly probe: Tauri + boringtun network measurement at the operator's edge →

The 80-domain test list that feeds the priority computation: Voidly's URL test list: how we curate the domains that reveal internet censorship →

The control server whose baseline measurements the scheduler output is validated against: The Voidly control server: how we tell censorship from a bad network →

The anomaly classifier that generates the HighPrioritySignals fed back to the scheduler: The Voidly anomaly classifier: five interference classes and why we optimize for recall →

The real-time pipeline that receives classified anomalies and routes HighPrioritySignals back to the scheduler: Voidly's real-time event pipeline: from measurement anomaly to journalist alert in under 8 minutes →

For how the labeled training data behind the classifier is built from OONI — ingestion, alignment, and Snorkel label functions: Building Voidly's classifier training dataset from OONI: ingestion, alignment, and label generation →