Technical writing

Voidly probe health monitoring: how we detect and replace failing probe nodes

October 8, 2025· 7 min read· AI Analytics

CensorshipVoidlyInfrastructureMethodology

Voidly's 37+ probe nodes are volunteer-operated desktop machines scattered across 200 countries. Unlike server-side infrastructure, which sits in managed data centers with uptime SLAs, probe nodes can go dark at any time: the operator closes the app, their ISP has an outage, the machine runs out of disk space, or the probe process crashes. The monitoring problem would be straightforward if probe failure were the only thing that could cause a probe to stop sending data — but it's not. A censor can also cut a probe off from our control server, producing silence that looks identical to a crashed process. Getting this distinction wrong has a real cost: a probe failure misread as censorship generates a false alert; censorship misread as probe failure causes a genuine event to go undetected.

The core problem: probe failure vs. censorship blocking

When a probe stops sending measurements, there are two candidate explanations. First: the probe has a technical problem — it has crashed, the operator's machine is off, or the network is down for reasons unrelated to censorship. Second: the probe is in a country that is actively blocking outbound connections to our control server, preventing measurements from being submitted even though the probe process is healthy.

These two cases must be separated before we can act. If we treat every silent probe as a censorship signal, probe crashes in high-risk countries generate false alerts. If we treat every silent probe as a technical failure, an ISP that blocks our control server endpoint goes undetected. The health monitoring system is built around resolving this ambiguity, not just detecting silence.

The heartbeat system

The first instrument is the heartbeat: a lightweight status packet each probe sends every 60 seconds, completely independent of the measurement traffic. Heartbeats use a separate transport — HTTPS POST to a /heartbeat endpoint that is distinct from the measurement submission endpoint and hosted on a different IP. This separation matters: if the censor is blocking only the measurement ingestion endpoint, the heartbeat can still reach us and tell us the probe is alive but its measurements are being filtered.

Each heartbeat packet carries the following fields:

{
  "probe_id":            "prb_7f3a91c2",
  "probe_cc":            "IR",
  "probe_asn":           "AS44244",
  "software_version":    "2.3.1",
  "uptime_seconds":      86412,
  "queue_depth":         17,           // measurements pending upload
  "last_measurement_at": 1728396103   // Unix timestamp
}

The queue_depth and last_measurement_at fields are critical for diagnosis. A probe with a large and growing queue depth is running measurements but cannot upload them — the measurement path is blocked. A probe with a zero queue depth and a stale last_measurement_at has stopped producing measurements entirely, which points to a process-level or test-list problem rather than a network block.

The probe state machine based on heartbeat timing:

PROBE_STATES = {
    "ONLINE":    "heartbeat received within last 5 minutes",
    "DEGRADED":  "no heartbeat for 5–15 minutes",
    "OFFLINE":   "no heartbeat for > 15 minutes",
    "FLAPPING":  "3+ ONLINE↔OFFLINE transitions in a 2-hour window",
    "INACTIVE":  "manually retired or operator churned",
}

def compute_probe_state(probe_id: str, now: datetime) -> str:
    last_hb = heartbeat_store.last_seen(probe_id)
    if last_hb is None:
        return "OFFLINE"
    elapsed = (now - last_hb).total_seconds()
    if elapsed < 300:
        return "ONLINE"
    if elapsed < 900:
        return "DEGRADED"
    return "OFFLINE"

Measurement quality scoring

A heartbeat tells us whether a probe process is alive — it does not tell us whether that probe is producing useful data. A probe can be heartbeating reliably while its measurements are systematically wrong: the test list version is stale, the probe's DNS resolver is poisoned, or the probe's ASN has such aggressive caching that DNS diversity is effectively zero. Quality scoring catches these cases.

Each probe receives a quality score from 0 to 100, computed every 30 minutes from the last 4 hours of measurements:

from dataclasses import dataclass

@dataclass
class ProbeHealthScore:
    probe_id:                   str
    measurement_rate:           float   # measurements/hour (vs. expected)
    error_rate:                 float   # fraction with measurement_error != null
    control_reachability:       float   # fraction where >= 1 control node responded
    dns_response_diversity:     float   # distinct resolved IPs / measurements
    composite_score:            int     # 0–100


def compute_quality_score(probe_id: str, window_hours: int = 4) -> ProbeHealthScore:
    measurements = db.query(
        "SELECT * FROM measurements WHERE probe_id = %s "
        "AND ts > NOW() - INTERVAL %s HOUR",
        (probe_id, window_hours)
    )
    if not measurements:
        return ProbeHealthScore(probe_id=probe_id, measurement_rate=0,
                                error_rate=1, control_reachability=0,
                                dns_response_diversity=0, composite_score=0)

    expected_rate = PROBE_TYPE_EXPECTED_RATE[probe_config(probe_id).type]
    rate_score = min(1.0, (len(measurements) / window_hours) / expected_rate)

    errors = [m for m in measurements if m.measurement_error is not None]
    error_score = 1.0 - (len(errors) / len(measurements))

    control_ok = [m for m in measurements if m.control_nodes_reached >= 1]
    control_score = len(control_ok) / len(measurements)
    # Probes where < 50% of measurements can reach any control node
    # cannot be verified — their data is untrustworthy regardless of volume.
    if control_score < 0.5:
        return ProbeHealthScore(probe_id=probe_id,
                                measurement_rate=rate_score,
                                error_rate=1 - error_score,
                                control_reachability=control_score,
                                dns_response_diversity=0,
                                composite_score=max(0, int(control_score * 20)))

    resolved_ips = [m.dns_resolved_ip for m in measurements if m.dns_resolved_ip]
    diversity = len(set(resolved_ips)) / len(resolved_ips) if resolved_ips else 0

    composite = int(
        rate_score    * 30 +
        error_score   * 25 +
        control_score * 30 +
        diversity     * 15
    )
    return ProbeHealthScore(probe_id=probe_id,
                            measurement_rate=rate_score,
                            error_rate=1 - error_score,
                            control_reachability=control_score,
                            dns_response_diversity=diversity,
                            composite_score=composite)

Probes with a composite score below 40 are flagged for operator review. Probes below 20 are automatically suspended from contributing to anomaly scoring until the score recovers across a full 4-hour window.

ASN coverage SLOs

Individual probe health is necessary but not sufficient. What matters for incident detection is whether each country maintains the ASN diversity required to reach a corroborated or verified confidence tier. We define per-country coverage SLOs based on two tiers:

Standard coverage. Any country with a population above 1 million must have at least 2 active, ONLINE probes on distinct ASNs at all times.
High-risk coverage. Countries in our elevated-monitoring tier (CN, RU, IR, BY, VN, ET, PK, NG) must maintain at least 4 active probes on distinct ASNs. These are the countries where we most need corroboration to rule out false positives from individual probe compromise.

The SLO check runs every 5 minutes and fires an alert to the operations queue when a country falls below threshold:

HIGH_RISK_COUNTRIES = {"CN", "RU", "IR", "BY", "VN", "ET", "PK", "NG"}
STANDARD_MIN_PROBES  = 2
HIGH_RISK_MIN_PROBES = 4

def check_coverage_slos(now: datetime) -> list[CoverageAlert]:
    alerts = []
    for country in all_monitored_countries():
        active_probes = [
            p for p in probes_by_country(country)
            if compute_probe_state(p.probe_id, now) == "ONLINE"
            and compute_quality_score(p.probe_id).composite_score >= 40
        ]
        distinct_asns = len({p.asn for p in active_probes})
        required = (HIGH_RISK_MIN_PROBES
                    if country in HIGH_RISK_COUNTRIES
                    else STANDARD_MIN_PROBES)
        pop = country_population(country)
        if pop < 1_000_000:
            continue  # below population threshold, no SLO
        if distinct_asns < required:
            alerts.append(CoverageAlert(
                country=country,
                distinct_asns=distinct_asns,
                required=required,
                deficit=required - distinct_asns,
            ))
    return alerts

Flapping detection

A probe that oscillates between ONLINE and OFFLINE is harder to handle than a probe that simply goes dark. Flapping can mean several things: the probe operator's home network has intermittent connectivity unrelated to censorship; the probe's ISP is intermittently blocking the heartbeat or measurement endpoints, which is itself a censorship-adjacent signal; or the probe process is crashing and restarting in a loop.

Flap detection triggers when a probe makes three or more ONLINE-to-OFFLINE or OFFLINE-to-ONLINE transitions within any rolling 2-hour window. Once flagged as FLAPPING, the probe's measurements continue to be collected but cannot contribute to a VERIFIED confidence tier — the instability means we cannot rule out partial data collection. The maximum confidence tier for a flapping probe's measurements is CORROBORATED, and all records from that probe carry probe_flapping: true in the dataset.

from collections import deque

FLAP_WINDOW_SECONDS = 7200   # 2 hours
FLAP_THRESHOLD      = 3      # transitions to trigger FLAPPING state

class ProbeStateTracker:
    def __init__(self):
        self._state: dict[str, str] = {}
        self._transitions: dict[str, deque] = {}

    def record_state(self, probe_id: str, new_state: str, ts: datetime) -> str:
        prev = self._state.get(probe_id, "UNKNOWN")
        if probe_id not in self._transitions:
            self._transitions[probe_id] = deque()

        if prev in ("ONLINE", "OFFLINE") and new_state in ("ONLINE", "OFFLINE") and prev != new_state:
            self._transitions[probe_id].append(ts)

        # Prune transitions outside the 2-hour window
        cutoff = ts - timedelta(seconds=FLAP_WINDOW_SECONDS)
        while self._transitions[probe_id] and self._transitions[probe_id][0] < cutoff:
            self._transitions[probe_id].popleft()

        if len(self._transitions[probe_id]) >= FLAP_THRESHOLD:
            self._state[probe_id] = "FLAPPING"
        else:
            self._state[probe_id] = new_state

        return self._state[probe_id]

A probe exits FLAPPING state only after it has maintained a continuous ONLINE state for the full 2-hour flap window with zero additional transitions. This cooldown prevents a probe from briefly stabilizing, shedding the FLAPPING label, and immediately flapping again.

Automated replacement

When a probe has been OFFLINE for more than 2 hours, its country's coverage SLO is violated, and that country has a waitlist of candidate operators, the system initiates an automated replacement. Waitlisted candidates are operators who have already completed vetting — they signed the operator agreement and installed the probe binary — but whose country had sufficient active coverage at the time they applied, so they were held in standby.

The automated replacement flow sends an activation email to the next eligible candidate on the country's waitlist. “Eligible” means the candidate is on a distinct ASN from all currently active probes in that country — activating a second candidate on the same ASN as an already-active probe does not improve coverage. The activation sets the probe status from STANDBY to ACTIVE and adds the probe's ID to the active measurement list so the collector begins accepting its submissions.

def attempt_automated_replacement(country: str, offline_probe_id: str) -> bool:
    active_asns = {
        p.asn for p in probes_by_country(country)
        if probe_tracker.state(p.probe_id) == "ONLINE"
    }
    candidates = waitlist.get_candidates(
        country=country,
        exclude_asns=active_asns,   # must be on a new ASN
        status="STANDBY",
    )
    if not candidates:
        ops_queue.enqueue(CoverageAlert(country=country, action="manual_recruit"))
        return False

    next_candidate = candidates[0]
    probe_registry.set_status(next_candidate.probe_id, "ACTIVE")
    measurement_list.add(next_candidate.probe_id)
    email.send_activation(next_candidate.operator_contact, next_candidate.probe_id)
    audit_log.record("auto_replacement", triggered_by=offline_probe_id,
                     replaced_by=next_candidate.probe_id, country=country)
    return True

This is the self-healing mechanism of the probe network: the pool of vetted standby operators acts as a reserve that automatically activates when coverage degrades. In practice, automated replacement resolves roughly 60% of SLO violations without manual intervention.

Distinguishing probe failure from censorship blocking

The key diagnostic rule is: if a probe's heartbeat is down, it is almost certainly a technical failure rather than censorship. Censorship that is severe enough to block measurement delivery will usually not also block a heartbeat sent to a different IP on a different HTTP endpoint. The heartbeat exists precisely to survive this scenario.

The exception is a highly restrictive network that blocks all outbound HTTPS, not just specific endpoints. This is rare — it would break ordinary web browsing for the operator — but it exists in some enterprise and government-controlled environments. In these cases, both heartbeat and measurements go silent simultaneously, which looks identical to a process crash.

The classify_offline_cause() function resolves the ambiguity by consulting peer probe data:

def classify_offline_cause(probe_id: str) -> OfflineCause:
    probe = probe_registry.get(probe_id)
    country, asn = probe.country, probe.asn

    # Check OONI probes in the same ASN — do they also show anomalies?
    ooni_same_asn = ooni_client.recent_failures(asn=asn, hours=2)

    # Check other Voidly probes in the same country — are they still healthy?
    peer_probes = [
        p for p in probes_by_country(country)
        if p.probe_id != probe_id
        and probe_tracker.state(p.probe_id) == "ONLINE"
    ]

    if ooni_same_asn.failure_rate > 0.5 and len(peer_probes) > 0:
        # OONI probes in the same ASN are also failing, but other-ASN probes
        # in the same country are healthy → ISP-level block (censorship signal)
        return OfflineCause.POSSIBLE_ISP_CENSORSHIP

    if not peer_probes:
        # No healthy peers in the country to compare against
        return OfflineCause.UNKNOWN

    if all(p for p in peer_probes):
        # All other country probes are healthy → localized to this probe's ASN or device
        if ooni_same_asn.failure_rate > 0.5:
            return OfflineCause.POSSIBLE_ASN_BLOCK
        return OfflineCause.PROBE_TECHNICAL_FAILURE

    # Multiple probes offline in the same country simultaneously →
    # country-wide event, possibly coordinated blocking or national outage
    return OfflineCause.POSSIBLE_COUNTRY_BLOCK

The result of this classification is not itself a censorship alert — it is a label that feeds into the incident clustering pipeline. A POSSIBLE_ISP_CENSORSHIP or POSSIBLE_COUNTRY_BLOCK cause opens an investigation thread; a PROBE_TECHNICAL_FAILURE cause routes to the operations queue for operator outreach.

Common failure modes and the operations runbook

The five failure modes that together account for over 80% of probe incidents, with their standard diagnosis and resolution:

Failure mode	Diagnosis signal	Resolution
Software version mismatch (not updated in >30 days)	`software_version` in heartbeat is >2 minor releases behind current	Automated email to operator with one-click update link; escalates to ops queue if no update after 72 hours
NAT / firewall blocking WireGuard UDP port	`queue_depth` growing, heartbeat healthy, WireGuard handshake timeout in probe logs	Email operator with port-forwarding instructions; offer fallback TCP encapsulation mode
Operator ISP blocking control server domain	`control_reachability` < 0.2, `classify_offline_cause` returns `POSSIBLE_ASN_BLOCK`	Activate replacement probe on different ASN; flag the block as a potential censorship signal for investigation
Probe process crash loop (affects versions <2.1.0)	Heartbeat uptime resets to 0 every few minutes; `software_version` confirms affected release	Force-update via push notification to probe binary update channel; known memory leak patched in 2.1.0
Operator churned (machine decommissioned or moved)	OFFLINE for >7 days, no response to two outreach emails, no activity in probe portal	Mark probe INACTIVE, release the ASN slot in coverage accounting, open a waitlist position in that country

Monitoring infrastructure

The probe health dashboard runs as a Cloudflare Worker pulling heartbeat records from D1, Cloudflare's edge SQLite database. The worker renders a per-country status grid at status.voidly.ai, showing each probe's current state, quality score, and last heartbeat timestamp. The page updates every 60 seconds on a client-side refresh; the data source is the same D1 table that the heartbeat ingestion endpoint writes to, so there is no intermediate aggregation layer that can go stale.

Heartbeat ingestion latency — the time from the probe sending a packet to the timestamp appearing in D1 — runs at p50 120ms and p99 800ms, matching the measurement pipeline latency. Both pipelines share the same Cloudflare Workers infrastructure and the same D1 database; the only difference is the ingestion endpoint and the destination table.

Alerting uses a PagerDuty webhook. Two alert conditions fire at the PagerDuty tier:

Coordinated offline alert. Two or more probes in a high-risk country (the 8-country tier above) go OFFLINE within the same 10-minute window. This pattern is consistent with a coordinated blocking event or a national network outage, and requires immediate human review.
Coverage SLO breach for high-risk country.Any high-risk country drops below 4 distinct-ASN active probes and the automated replacement has either failed or there are no waitlisted candidates to activate.

Lower-severity events — standard-tier SLO breaches, single-probe quality degradations, flapping probes — go into a Slack operations channel on a 15-minute digest. The operations team triages this channel during business hours; the PagerDuty alerts are the only ones that produce an on-call page outside of working hours.

For the probe application that generates the measurements being monitored: The Voidly Probe: Tauri + boringtun network measurement at the operator's edge →

For how the control server comparison distinguishes censorship from ordinary network failures: The Voidly control server: how we tell censorship from a bad network →

For the criteria that determine which ASNs and countries receive probes in the first place: Voidly probe vantage selection: ASN diversity, operator safety, and reaching hard-to-measure countries →

For how the probe measures bandwidth throttling — the TimingFeatures struct, TTFB z-score against control, body truncation signals, and the congestion vs. deliberate-throttling calibration problem: How Voidly measures bandwidth throttling: timing signals, body truncation, and the calibration problem →

For how probes detect DNS injection — forged A records, TTL anomalies, source IP divergence, and per-country injection rates: How Voidly detects DNS injection: forged responses, injection rates by country, and pipeline integration →