Technical writing

Voidly's country-level censorship score: aggregating 2.2B probe measurements into the global index

July 8, 2025· 9 min read· AI Analytics

CensorshipVoidlyMethodologyData engineering

Voidly generates roughly 100,000 probe measurements per day, spread across 200 countries and more than 37 vantage points. Each measurement produces a set of per-class interference probabilities from the anomaly classifier —prob_dns_tampering, prob_tls_interference, prob_http_blocking, prob_bgp_withdrawal, and prob_throttling — along with metadata about the probe, the target domain, and which external sources corroborate the observation.

The global censorship rankings published at voidly.ai collapse all of that into a single number per country: censorship_score, a value between 0.0 and 1.0 where 1.0 represents pervasive, multi-vector, cross-source-verified blocking across all domain categories. This post explains the computation behind that number — every weighting factor, the rolling window, the smoothing pass, and how uncertainty bounds are derived.

The aggregation problem

Naive averaging doesn't work. Consider a country where 80% of all probe measurements originate from a single mobile operator — a common situation in markets where mobile internet dominates and fixed-line broadband penetration is low. If that operator implements aggressive DNS-level blocking (and many do, often commercially rather than politically motivated), the country's raw average interference probability would be dominated by one ISP's policy. A country could score 0.7 purely because a single mobile operator blocks adult content, while national-level political censorship in a more ASN-diverse country scores 0.3 because blocks are spread thinly across many probes.

There are four distinct biases the aggregation needs to handle:

ASN concentration bias. Measurements from a single ISP are not independent. If one ISP blocks, every probe on that ISP fires — but that's still one blocking policy, not N independent observations of a national pattern.
Domain category bias. A country that blocks adult content uniformly is not meaningfully more censored than one that doesn't, from a press-freedom standpoint. Category weighting ensures that the score reflects the kinds of censorship that matter for human rights and political freedom.
Recency bias. A censorship event that ended six months ago should not contribute equally to a country's current score. The score should reflect the country's present-day censorship environment, not its history.
Verification bias. An uncorroborated classifier output is worth less than one confirmed by OONI, CensoredPlanet, or IODA. Weighting by corroboration status ensures that uncertain detections don't dominate the score when cross-source data is available.

The composite formula addresses all four simultaneously, which is what makes it substantially more complex than a weighted average.

What goes into a measurement's contribution

The anomaly classifier emits five per-class probabilities for each measurement. The per-measurement interference probability — what we call prob_any_interference — is the complement of the probability that none of the five interference types are active:

prob_any_interference = 1 - (
    (1 - prob_dns_tampering)
    * (1 - prob_http_blocking)
    * (1 - prob_tls_interference)
)

# Note: BGP withdrawal and throttling are excluded from this combination
# because they are handled as separate event types in the incident system.
# They do influence the country score via their own confidence_tier promotion,
# but are not folded into prob_any_interference to avoid double-counting.

This is the standard inclusion-exclusion formula for the union of independent probabilities — it avoids the double-counting that comes from summing. In practice DNS tampering, HTTP blocking, and TLS interference are not fully independent (a government can deploy all three on the same target), but treating them as independent is a conservative approximation: it produces a lower bound on the true interference probability, which is appropriate given the recall-heavy classifier design.

Only measurements with confidence_tier ≥ corroboratedfeed into the country score computation. Raw observed-tier measurements are excluded: they represent single-source, unconfirmed classifier outputs that have not yet been validated by external sources. Requiring at least the corroborated tier substantially reduces noise at the cost of some coverage in data-sparse countries.

Recency decay

Each measurement's contribution decays exponentially with age. The half-life is 30 days: a measurement that is 30 days old contributes half as much as an identical measurement made today; one that is 60 days old contributes a quarter; one that is 90 days old contributes 12.5%. Measurements older than 90 days are excluded from the rolling window entirely.

import math

HALF_LIFE_DAYS = 30
LAMBDA = math.log(2) / HALF_LIFE_DAYS  # ≈ 0.02310

def recency_weight(age_days: float) -> float:
    """
    Exponential decay weight for a measurement aged age_days days.
    Returns 0.0 for measurements older than 90 days.
    """
    if age_days > 90:
        return 0.0
    return math.exp(-LAMBDA * age_days)

# Spot-check
# age=0  → weight=1.000
# age=30 → weight=0.500  (one half-life)
# age=60 → weight=0.250
# age=90 → weight=0.125  (threshold before cutoff)

The 30-day half-life is intentionally short. It means the country score is dominated by the trailing month, with the 31–90-day window providing a slowly decaying tail that prevents the score from swinging violently when coverage fluctuates. A country that was heavily censored two months ago but relaxed restrictions last week will show a declining score, not a stable one anchored to historical conditions.

The 90-day cutoff is a rolling window parameter, not a philosophical boundary. The shutdown forecasting model requires a longer historical signal and operates on a separate 365-day dataset. The country score is explicitly a present-state indicator, not a historical record.

ASN diversity weighting

ASN weighting is the most counterintuitive component of the formula. The goal is to ensure that a country with 1 active ASN and a country with 20 active ASNs are scored based on the breadth of their blocking, not just the volume of measurements.

For each measurement m on autonomous system A in a country window with K unique active ASNs, the ASN weight is:

def asn_weight(K: int) -> float:
    """
    Per-measurement ASN weight for a country with K unique active ASNs
    in the rolling window.

    All measurements on the same ASN receive the same weight, regardless
    of how many measurements there are on that ASN. This means a country
    with 5 active ASNs gets each ASN weighted as 1/√5 ≈ 0.447, while a
    country with 1 active ASN gets weight 1/√1 = 1.0.
    """
    return 1.0 / math.sqrt(max(1, K))

Critically, this weight is uniform across all probes on ASN A — it does not scale with the number of probes on that ASN. If 5,000 measurements in the window come from a single dominant mobile operator in a country with 3 active ASNs, all 5,000 measurements get weight 1/√3 ≈ 0.577. The 5,000 probes represent one ASN's blocking policy, not 5,000 independent signals.

The square root in the denominator is a deliberate middle ground. It applies less aggressive discounting than 1/K (which would weight each ASN as equal regardless of measurement count) but more aggressive than no discounting at all. In practice, for markets with 5 active ASNs, each ASN contributes at ≈0.447 weight; at 10 ASNs, ≈0.316; at 20 ASNs, ≈0.224. The formula grows logarithmically slower than the number of ASNs, meaning the marginal benefit of additional ASN diversity diminishes — which matches our empirical finding that the difference between 15 and 20 active ASNs is less meaningful than the difference between 1 and 5.

Domain category weighting

Not all blocked domains contribute equally to the censorship score. Blocking an adult content site is commercially common, legally routine in many jurisdictions, and carries little signal about political censorship. Blocking a news media site or a VPN circumvention tool is a direct signal of information control.

Each measurement is tagged with its target domain's category (derived from the test list metadata and augmented with our own classifier). The category weight multiplies the measurement's contribution:

CATEGORY_WEIGHTS: dict[str, float] = {
    'news_media':        2.0,   # highest signal — directly targets information access
    'social_media':      1.8,   # platforms used for political organizing
    'messaging':         1.8,   # encrypted messaging apps (Signal, WhatsApp, etc.)
    'political_content': 1.6,   # opposition sites, electoral information
    'human_rights':      1.6,   # NGOs, human rights orgs, documentation sites
    'vpn_circumvention': 1.5,   # tools for bypassing censorship infrastructure
    'lgbtq':             1.4,   # targeted in many censorship regimes
    'religious':         1.2,   # selectively targeted in theocratic contexts
    'adult_content':     0.8,   # common commercial/legal block — lower signal
    'gaming':            0.5,   # least signal — often bandwidth management
    'other':             1.0,   # baseline for uncategorized domains
}

def category_weight(domain_category: str) -> float:
    return CATEGORY_WEIGHTS.get(domain_category, 1.0)

The weights were calibrated against a ground-truth set of known-censorship countries: the formula should rank Iran, China, and Turkmenistan near the top of the global index regardless of which domain categories their probes happen to measure most frequently. Countries that block only adult content (common in the Gulf states for commercial rather than political reasons) should score lower than countries that block news media and political content even if the raw measurement counts are similar.

One subtlety: the vpn_circumvention category is particularly informative because blocking circumvention tools is itself evidence that the government is aware of and trying to suppress other blocking. It's a meta-signal — blocking VPNs implies there is something to circumvent.

Cross-source corroboration multiplier

When an independent source — OONI, CensoredPlanet, or IODA — corroborates a Voidly measurement, the measurement's contribution to the country score is amplified. This reflects the epistemically stronger claim: two independent measurement pipelines saw the same interference, which substantially reduces the probability that it was a false positive.

Each measurement has a corroboration_score field computed by the cross-source reconciler: a value in [0.0, 1.0] that reflects how strongly external sources confirm the interference. The contribution multiplier is:

def corroboration_multiplier(corroboration_score: float) -> float:
    """
    Amplification factor for cross-source corroboration.
    A fully corroborated measurement (score=1.0) contributes 2x as much
    as an uncorroborated measurement (score=0.0), which contributes 1x.
    """
    return 1.0 + corroboration_score

# Examples:
# corroboration_score = 0.0  → multiplier = 1.0 (no external confirmation)
# corroboration_score = 0.5  → multiplier = 1.5 (partial corroboration)
# corroboration_score = 1.0  → multiplier = 2.0 (fully corroborated)

In practice, for countries where OONI has strong coverage, the majority of measurements in the corroborated tier have a nonzero corroboration score. Countries where external source coverage is thin (many sub-Saharan African markets, for instance) have most measurements clustered at corroboration_score near 0, which is structurally honest: we are less certain about those measurements, and the score reflects that.

The composite formula

Combining all four components, the country score at time t over the 90-day rolling window is:

# score(country, t) =
#   Σ_m [ decay(m) × asn_weight(m) × category_weight(m)
#          × (1 + corroboration_score(m))
#          × prob_any_interference(m) ]
#   ─────────────────────────────────────────────────────
#   Σ_m [ decay(m) × asn_weight(m) × category_weight(m)
#          × (1 + corroboration_score(m)) ]
#
# where the sum Σ_m is over all measurements in the 90-day window
# with confidence_tier >= corroborated.

def compute_country_score(
    measurements: list[dict],
    reference_date: date,
) -> tuple[float, float, float]:
    """
    Returns (score, score_lower_90, score_upper_90).
    measurements: list of dicts with fields:
        measured_at, asn, domain_category, corroboration_score,
        prob_dns_tampering, prob_http_blocking, prob_tls_interference
    """
    K = len({m['asn'] for m in measurements})  # unique active ASNs

    weighted_sum = 0.0
    weight_sum = 0.0

    for m in measurements:
        age_days = (reference_date - m['measured_at'].date()).days
        d = recency_weight(age_days)
        if d == 0.0:
            continue

        a = asn_weight(K)
        c = category_weight(m['domain_category'])
        corr = 1.0 + m['corroboration_score']

        p = 1.0 - (
            (1.0 - m['prob_dns_tampering'])
            * (1.0 - m['prob_http_blocking'])
            * (1.0 - m['prob_tls_interference'])
        )

        weight = d * a * c * corr
        weighted_sum += weight * p
        weight_sum += weight

    if weight_sum == 0.0:
        return (0.0, 0.0, 0.0)

    score = weighted_sum / weight_sum
    lower, upper = bootstrap_confidence_interval(measurements, reference_date, n=1000, ci=0.90)
    return (round(score, 4), round(lower, 4), round(upper, 4))

The normalizer in the denominator is the sum of all weights — it ensures the score stays in [0, 1]. Note that the ASN weight term appears in both numerator and denominator, so its absolute value does not affect the final score — only the relative weighting across ASNs matters. The formula is equivalent to: for each ASN, compute the category-weighted, recency-weighted, corroboration-weighted average interference probability; then average those per-ASN scores with equal weight across ASNs (the square root term produces this approximately, with diminishing returns at high ASN counts).

Per-country baseline calibration

The formula above produces a score in [0, 1], but the absolute value depends on how many measurements are available and how representative the probed domains are of the country's actual blocking environment. China's score is anchored by 47 active ASNs and over 12 million measurements in the 90-day window. Eritrea's score is anchored by 3 active ASNs and approximately 200 measurements per month.

We do not apply a baseline shift to the score itself — we do not force China or Iran to score higher just because we know from context that they are heavily censored. The score is an empirical quantity derived from measurements, not a political judgment. What we do apply is a coverage-dependent uncertainty band:

def coverage_adjustment(n_measurements: int) -> dict:
    """
    Computes the coverage metadata attached to the API response.
    Countries with fewer than 500 measurements get a low_coverage flag
    and a wider confidence interval.
    """
    if n_measurements < 500:
        return {
            'low_coverage': True,
            'coverage_tier': 'sparse',
            'ci_width_multiplier': 2.0,  # doubled CI width
        }
    elif n_measurements < 5_000:
        return {
            'low_coverage': False,
            'coverage_tier': 'moderate',
            'ci_width_multiplier': 1.3,
        }
    else:
        return {
            'low_coverage': False,
            'coverage_tier': 'high',
            'ci_width_multiplier': 1.0,
        }

In practice, the widened uncertainty band for sparse-coverage countries is the most important calibration. A score of 0.45 ± 0.02 for Germany (high coverage) means something different from a score of 0.45 ± 0.18 for Eritrea (sparse coverage). The consumer of the API should treat the confidence interval as load-bearing, not decorative.

Temporal smoothing

The raw daily score computed by the formula above is noisy. Measurement coverage fluctuates: a probe node going offline for a day can reduce a country's measurement count by 20%, causing a spurious score dip that reverses the next day when the node comes back. Publishing raw daily scores would create a noisy time series that misleads users trying to track trends.

We apply a 7-day Gaussian kernel with σ = 3 days to the daily score time series before publishing. The kernel is truncated at ±3σ (effectively a ±9-day window) and renormalized:

import numpy as np
from scipy.ndimage import gaussian_filter1d

def smooth_score_series(daily_scores: np.ndarray, sigma: float = 3.0) -> np.ndarray:
    """
    Apply a 1D Gaussian kernel to a daily score time series.
    sigma=3 days means 68% of the weight is within ±3 days of each point.
    NaN values (days with no measurements) are handled by linear interpolation
    before smoothing, then masked back afterward.
    """
    nan_mask = np.isnan(daily_scores)
    if nan_mask.all():
        return daily_scores

    # Interpolate over gaps before smoothing
    indices = np.arange(len(daily_scores))
    filled = np.interp(indices, indices[~nan_mask], daily_scores[~nan_mask])

    smoothed = gaussian_filter1d(filled, sigma=sigma, mode='nearest')

    # Restore NaN for days that had no measurements at all (no coverage)
    smoothed[nan_mask] = np.nan
    return smoothed

Smoothing is applied to the score used for display on the voidly.ai dashboard and for the censorship_score field in the API response. The unsmoothed signal is retained separately and used by the event detection layer, which needs to see sharp spikes — the onset of a censorship event or an internet shutdown produces a step-function increase in the raw score that the Gaussian kernel would smear into a ramp. Event detection operates on the raw signal so that it can catch the onset within hours, not days.

The 30-day delta (censorship_score_30d_delta) is computed as the difference between the smoothed score today and the smoothed score 30 days ago. Positive deltas indicate worsening censorship; negative deltas indicate improvement.

Uncertainty bounds

The 90% bootstrap confidence interval is computed by resampling the measurement pool with replacement 1,000 times and computing the score for each resample. The 5th and 95th percentiles of the resulting distribution are published as censorship_score_lower and censorship_score_upper:

def bootstrap_confidence_interval(
    measurements: list[dict],
    reference_date: date,
    n: int = 1000,
    ci: float = 0.90,
) -> tuple[float, float]:
    """
    90% bootstrap CI over the weighted country score.
    Resamples at the measurement level (not the ASN level) — the ASN
    weighting formula handles concentration automatically, so measurement-level
    resampling is sufficient to capture sampling variance.
    """
    scores = []
    rng = np.random.default_rng(seed=42)

    for _ in range(n):
        sample = rng.choice(measurements, size=len(measurements), replace=True)
        s, _, _ = compute_country_score(list(sample), reference_date)
        scores.append(s)

    alpha = (1.0 - ci) / 2.0
    lower = float(np.quantile(scores, alpha))
    upper = float(np.quantile(scores, 1.0 - alpha))
    return lower, upper

Countries with fewer than 500 measurements in the rolling window receive a doubled CI width (the raw bootstrap CI multiplied by 2.0) and the low_coverage flag in the API response. The bootstrap alone understates uncertainty for sparse-data countries because the resamples are constrained to the small pool of available measurements — they cannot capture the uncertainty that comes from not measuring domains we haven't probed.

Example: computing Iran's score

Let's walk through a concrete computation for Iran over a recent 90-day window.

Measurement pool: 47,000 measurements in the 90-day window, all at confidence_tier ≥ corroborated.
Active ASNs: 12 unique ASNs observed in the window. ASN weight = 1/√12 ≈ 0.289, applied uniformly to all measurements on each ASN. Since this factor is uniform across all measurements, it cancels in the weighted average and does not shift the absolute score — it would matter if we were comparing across countries with different ASN counts, which is exactly the case when computing the global rankings.
Domain category breakdown: 38% of measurements target news_media or political_content (combined weight 2.0 and 1.6), 22% target social_media or messaging (weight 1.8 each), 15% target vpn_circumvention (weight 1.5), 14% target religious or lgbtq content (weights 1.2, 1.4), and the remaining 11% span other categories.
Corroboration rate: 73% of measurements have corroboration_score ≥ 0.5 (corroborated by at least one external source). Average corroboration_score across the pool: 0.61, giving an average multiplier of 1.61.
Interference rates by category:news_media: avg prob_any_interference = 0.91; social_media/messaging: 0.87; vpn_circumvention: 0.96; religious: 0.44; lgbtq: 0.78; other: 0.31.

After applying all weights and the 90-day recency decay, the composite score converges to approximately 0.81. The 90% bootstrap confidence interval over 47,000 measurements is tight: ±0.03. The published API response for Iran:

{
  "country_code": "IR",
  "country_name": "Iran",
  "censorship_score": 0.8103,
  "censorship_score_lower": 0.7811,
  "censorship_score_upper": 0.8394,
  "censorship_score_30d_delta": +0.0214,
  "measurement_count_90d": 47000,
  "active_asn_count": 12,
  "corroboration_rate": 0.73,
  "low_coverage": false,
  "coverage_tier": "high",
  "window_start": "2025-04-08",
  "window_end": "2025-07-08"
}

The positive 30-day delta (+0.021) reflects a tightening of restrictions observed in the measurement signal over the preceding month — consistent with geopolitical events documented in the incident log. The delta is computed from smoothed scores, so short-term measurement fluctuations do not cause noise in the trend signal.

Where the score appears

The country censorship score surfaces in three places in the Voidly system:

The REST API. GET /v1/countries/{cc}/summary returns censorship_score, censorship_score_lower, censorship_score_upper, censorship_score_30d_delta, and the coverage metadata. Time-series data is available via GET /v1/countries/{cc}/score-history?window=90d, which returns the smoothed daily score for the trailing period.
The voidly.ai dashboard rankings. The global rankings table uses the 30-day smoothed score (i.e., the smoothed score as of the most recent day in the window, not an average of the window itself). Countries are sorted descending. The delta indicator (red up-arrow, green down-arrow) is derived from censorship_score_30d_delta.
The shutdown forecasting model. censorship_score is one of the input features used by the 7-day internet shutdown prediction model. Historically, countries that experience sudden internet shutdowns tend to have elevated and rising censorship scores in the weeks preceding the shutdown — the score captures the background level of infrastructure-level control that precedes more severe disruption. The forecasting model consumes the raw (unsmoothed) daily score as a time-series feature, not the smoothed published score.

One important caveat: censorship_score is not a ranking of “how evil” a government is. It is a measurement of detected, cross-source-corroborated interference with access to specific domain categories, weighted by the political significance of those categories. Countries with low probe coverage will have wider uncertainty bands that may not reflect their true censorship environment. Countries that implement censorship through legal pressure on platforms (content takedowns, shadow bans) rather than technical blocking will score lower than countries with equivalent practical restrictions on information access. The score measures what the probes measure, and the probes measure network-layer interference — not the full landscape of information control.