Technical writing

Per-domain censorship history in Voidly: tracking blocking events across countries and time

· 9 min read· AI Analytics
CensorshipVoidlyInfrastructureMethodology

The country-level censorship index aggregates millions of per-measurement signals into a single score per country. The per-ASN analysis disaggregates those signals by autonomous system to identify which carriers enforce which rules. But there is a third dimension that neither view captures on its own: the full lifecycle of a blocking event for a specific domain — when it started, which countries acted, how long it lasted, and whether those countries coordinated. This is the domain history dimension, and it is the primary lens for understanding whether a domain is structurally blocked or transiently targeted.

For any domain in the Voidly test list, the system maintains a complete record of every measurement across all probe countries — not just the current state, but the entire trajectory from first detection through to the present day. This post covers the data model, the API, and the analytical patterns that emerge when you treat blocking as a time-series phenomenon rather than a snapshot.

The domain history data model

Raw probe measurements land in TimescaleDB as individual rows — one per probe run, per domain, per country. At that granularity the table is too large for ad hoc queries: 100,000 measurements per day across 50,000 tracked domains and 200 countries would produce tables in the billions of rows within months. The primary read path for domain-centric queries is the domain_measurement_summary table, a TimescaleDB continuous aggregate that collapses the raw measurements into daily per-domain-per-country summaries:

CREATE TABLE domain_measurement_summary (
  domain           TEXT NOT NULL,
  country_code     TEXT NOT NULL,
  measurement_date DATE NOT NULL,
  total_probes     INTEGER NOT NULL,
  blocked_probes   INTEGER NOT NULL,
  blocking_rate    REAL NOT NULL,        -- blocked/total
  interference_types TEXT NOT NULL,      -- JSON array of types seen
  confidence       REAL NOT NULL,
  PRIMARY KEY (domain, country_code, measurement_date)
);
-- Populated by TimescaleDB continuous aggregate from raw measurements

The blocking_rate column is the fraction of probes that reported a confirmed blocking signal for that domain in that country on that calendar day. A value above 0.5 is treated as a “blocked day” for the purposes of streak and duration calculations. The threshold is deliberately conservative — brief DNS failures or transient network errors can drive single-probe measurements to a 1.0 rate without representing real blocking. Requiring a majority of probes to agree filters most transient noise while still capturing genuine partial-deployment blocks (where only some ISPs in a country comply).

The interference_types column is a JSON array of the interference classes observed in that day's measurements — e.g., ["dns_injection", "http_block_page"]. Multiple types can appear simultaneously when a country deploys layered enforcement: DNS injection as the primary mechanism and an HTTP block page as a fallback for clients using alternative resolvers. The daily aggregate retains the union of all types observed across all probes in that day.

First-seen and last-seen tracking

For each domain/country pair, Voidly maintains a derived statistics view that computes the full blocking lifecycle from the summary table. The four key fields are first_blocked_at, last_blocked_at, total_blocked_days, and longest_block_streak_days:

CREATE VIEW domain_blocking_stats AS
SELECT
  domain,
  country_code,

  -- Earliest confirmed blocking measurement (confidence >= 0.7)
  MIN(measurement_date)
    FILTER (WHERE blocking_rate > 0.5 AND confidence >= 0.7)
    AS first_blocked_at,

  -- Most recent confirmed blocking
  MAX(measurement_date)
    FILTER (WHERE blocking_rate > 0.5 AND confidence >= 0.7)
    AS last_blocked_at,

  -- Calendar days where blocking_rate > 0.5
  COUNT(*)
    FILTER (WHERE blocking_rate > 0.5 AND confidence >= 0.7)
    AS total_blocked_days,

  -- Longest continuous blocking period (gaps of <= 1 day tolerated)
  MAX(streak_length) AS longest_block_streak_days

FROM (
  SELECT
    domain,
    country_code,
    measurement_date,
    blocking_rate,
    confidence,
    -- Gap-and-island: assign each row to a streak group
    COUNT(*) FILTER (WHERE blocking_rate <= 0.5 OR confidence < 0.7)
      OVER (PARTITION BY domain, country_code
            ORDER BY measurement_date
            ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
      AS streak_group
  FROM domain_measurement_summary
) grouped
JOIN LATERAL (
  SELECT COUNT(*) AS streak_length
  FROM domain_measurement_summary s2
  WHERE s2.domain = grouped.domain
    AND s2.country_code = grouped.country_code
    AND s2.blocking_rate > 0.5
    AND s2.confidence >= 0.7
    -- same streak group
) streak_calc ON true
GROUP BY domain, country_code;

The confidence >= 0.7 threshold filters out measurements in the raw observed confidence tier. Only measurements that have reached at least the corroborated tier — confirmed by at least one external source or by probe consensus — are counted as confirmed blocking events. This prevents transient classifier outputs from polluting the first-seen timestamps, which matter for research workflows that treat first_blocked_at as the authoritative date a blocking event began.

The longest_block_streak_days field uses a gap-and-island approach with a one-day gap tolerance: if measurements are available for 29 of 30 consecutive days and all 29 show blocking, that is treated as a 30-day streak rather than two 29-day streaks separated by a measurement gap. This tolerance is necessary because probe coverage is not perfectly uniform — nodes occasionally miss a daily measurement window due to connectivity issues, and treating those gaps as the end of a streak would dramatically undercount long-duration blocking events.

The domain history API endpoint

The GET /v1/domains/{domain}/history endpoint is the primary external interface to domain-level blocking data. It returns a summary object covering all countries in which the domain has been probed, along with per-country blocking statistics:

{
  "domain": "example.com",
  "global_blocking_rate": 0.12,
  "countries_with_blocking": 8,
  "measurement_countries": 47,
  "history": [
    {
      "country_code": "CN",
      "blocking_rate_30d": 0.97,
      "interference_type": "dns_injection",
      "first_blocked_at": "2019-06-04",
      "last_blocked_at": "2025-07-11",
      "is_ongoing": true
    }
  ]
}

The global_blocking_rate is the fraction of measurement countries that reported blocking_rate > 0.5 in the trailing 30 days — in the example above, 8 of 47 countries where the domain is probed are actively blocking it. This is a presence/absence metric rather than an intensity metric: a country that blocks a domain on 29 of 30 days counts the same as one that blocks it on 16 of 30. The per-country blocking_rate_30d field provides the intensity signal for countries where it matters.

The is_ongoing flag is set to true when last_blocked_at is within the past 14 days. This threshold accounts for the measurement cadence: domains in sensitive categories are probed daily, so a 14-day gap would represent a genuine cessation; domains in lower-priority categories may have a 7-day probing interval, making a 14-day gap ambiguous. The flag is a convenience field — callers that need precise freshness semantics should compare last_blocked_at against the last_measurement_at field (also returned by the endpoint) to determine whether the absence of recent blocking reflects genuine unblocking or a measurement gap.

Temporal patterns in domain blocking

Analysis of 50,000 unique domains that experienced at least one blocking event in 2025 reveals structured temporal patterns that differ sharply by interference type and political context.

  • Median blocking duration varies by mechanism. DNS injection events have a median duration of 14 days. HTTP block page events — which typically indicate a court order or formal regulatory action — have a median duration of 180+ days. The difference reflects the enforcement architecture: DNS injection can be deployed and retracted at the ISP level in minutes, while HTTP block pages backed by legal orders require formal proceedings to remove.
  • 23% of newly blocked domains are unblocked within 7 days. These transient blocks cluster around politically sensitive dates — election days, protest events, and national security incidents. A news site blocked for 48 hours during a vote count, then unblocked, produces a first-seen and last-seen separated by two days. The total_blocked_days field distinguishes these from long-duration blocks: a domain with total_blocked_days = 2 and longest_block_streak_days = 2 is a candidate transient block.
  • 41% of blocked domains show geographically synchronized blocking. Defined as blocking appearing in 3 or more countries within a 24-hour window, this pattern is characteristic of coordinated takedowns — typically involving MLAT requests, shared law enforcement action, or compliance with a multi-jurisdictional court order. The remaining 59% show sequential spread over days to weeks, consistent with independent national-level enforcement decisions.
  • Wikipedia and Tor are the most consistently blocked domains globally. Both appear in 40+ countries over 5+ years of measurement data, spanning every interference type in the classifier's taxonomy. Tor is blocked via DNS injection in the majority of cases; Wikipedia blocks more frequently use HTTP interception (with country-specific article filtering a common pattern in countries that partially block rather than fully block).

Cross-country blocking correlation

When a domain becomes newly blocked in one country, the probability that it will appear blocked in another country within the next 30 days is a function of the domain's category. For news media domains, that conditional probability is 0.31 — nearly one in three newly blocked news sites will see blocking spread to at least one other country within 30 days. For general web domains, the probability drops to 0.04.

The category correlation matrix captures these relationships across all domain categories and country pairs. Each cell in the matrix contains the empirically derived 30-day spread probability for domains of that category. The matrix is sparse — most category/country pairs have near-zero correlation — but the high-value cells are structurally stable across years of data:

# Category cross-country blocking spread (30-day conditional probability)
# P(domain blocked in country B | newly blocked in country A, domain.category = C)
#
# Computed quarterly from the domain_measurement_summary table.
# Only country pairs with >= 50 joint observations included.
#
# Sample rows (category = NEWS_MEDIA):
#   source_country  target_country  spread_prob_30d
#   RU              BY              0.58
#   CN              VN              0.41
#   IR              TJ              0.36
#   TR              AZ              0.29
#   EG              SA              0.22
#
# These are not causal estimates — they reflect shared political context
# (similar state censorship doctrine, coordinated enforcement regimes)
# more than direct information flow between censors.

The shutdown forecasting model uses this matrix as a feature: when a domain in a high-spread category is newly blocked in a country with known high-correlation neighbors, the model increases its estimated probability of broader blocking events in those neighbors. The mechanism is not assumed to be causal — Russia blocking a news site does not cause Belarus to block it — but the correlation is predictively useful because both countries respond to the same underlying political pressures on similar timescales.

The matrix is recalculated quarterly from the summary table. Country pairs with fewer than 50 joint observations are excluded from the matrix to prevent spurious high-probability entries in sparse-coverage pairs. A separate low-data Bayesian prior (category-level average spread probability) fills in for excluded pairs during forecasting.

The domain blocking timeline visualization

The REST API supports a timeline response format for charting blocking history over time. The format=timeline parameter causes the endpoint to return measurement-level aggregates bucketed into 7-day windows rather than the per-country summary structure:

GET /v1/domains/{domain}/history?format=timeline&country=RU

# Response:
{
  "domain": "example.com",
  "country_code": "RU",
  "window_days": 7,
  "series": [
    {
      "week_start": "2025-06-01",
      "blocking_rate": 0.94,
      "probe_count": 312,
      "interference_types": ["dns_injection"],
      "confidence": 0.91
    },
    {
      "week_start": "2025-06-08",
      "blocking_rate": 0.96,
      "probe_count": 298,
      "interference_types": ["dns_injection", "http_block_page"],
      "confidence": 0.93
    }
    // ... one entry per 7-day window
  ]
}

The 7-day window is fixed for the timeline format — it balances temporal resolution against noise. Daily granularity would show coverage-driven fluctuations (fewer probes on weekends in some markets) as apparent blocking rate changes; monthly granularity would obscure the onset dates of new blocking events. The 7-day window is the resolution at which onset events are reliably visible as a step increase in blocking_ratewithout being obscured by probe count variability.

The Voidly web interface uses this endpoint to render the blocking history chart for each domain/country combination. The interference_types array drives the chart's color encoding — a week showing only DNS injection is colored differently from a week showing a transition to layered enforcement. Transitions between interference types are particularly informative: a shift from DNS injection to HTTP block page often indicates escalation from ISP-level enforcement to a formal court order.

Domain category classification

Each tracked domain is assigned a category from the OONI category code taxonomy: NEWS_MEDIA, HUMAN_RIGHTS, LGBTQI+, CIRCUMVENTION, GOVERNMENT, SOCIAL_NETWORKS, and approximately 30 others. Category drives three system behaviors:

  • Priority scheduling. High-risk categories — NEWS_MEDIA, CIRCUMVENTION, HUMAN_RIGHTS — are assigned a target measurement interval of 1 day. Lower-risk categories receive intervals of 3 to 7 days. The measurement scheduler uses category as the primary scheduling signal, with recent blocking history as a secondary signal that can override the category default.
  • Trend analysis. Category enables aggregate queries like “which domain categories saw the most newly blocked domains in Q2 2025?” In Q2 2025, CIRCUMVENTION and NEWS_MEDIA led new blocking events, driven by increased enforcement in Southeast Asia and Eastern Europe respectively. Category-level trend analysis surfaces patterns that are invisible in raw domain counts — a surge in VPN blocking often precedes broader information control campaigns.
  • False positive rates. Streaming services and content delivery domains have structurally higher geoblocking rates — legal licensing restrictions cause them to return error responses from foreign IP addresses regardless of any censorship intent. The classifier's prior for p_geoblock is higher for ENTERTAINMENT and CDN categories, which reduces the probability that a failed request is classified as censorship rather than geoblocking. Correct category assignment is therefore load-bearing for classification accuracy, not merely metadata.

Bulk domain query support

The API accepts batch queries for use cases that require checking many domains simultaneously.POST /v1/domains/batch accepts a JSON body with up to 500 domain names and returns a summary object per domain — the same fields as the single-domain endpoint, but without the full per-country history array (which would be impractical at batch scale).

POST /v1/domains/batch
Content-Type: application/json

{
  "domains": [
    "example.com",
    "anotherdomain.org",
    // ... up to 500
  ],
  "countries": ["RU", "CN", "IR"],   // optional: filter to specific countries
  "fields": ["global_blocking_rate", "countries_with_blocking", "is_ongoing"]
}

# Response: one summary object per domain
# Runtime: < 200ms for 500 domains via materialized view lookups

The batch endpoint is designed for two common operational workflows. The first is compliance checking: an organization running a CDN or content delivery service can check whether any of their origin domains are blocked in countries where they operate — a domain blocked in a sanctioned country may require legal review before continued service. The second is research batch processing: a researcher studying press freedom can submit a list of 200 news sites and immediately see which are blocked in Russia, without issuing 200 sequential API calls.

The countries filter parameter constrains the per-country data returned in the response. Without it, the response includes a summary across all measurement countries; with it, the response includes per-country breakdown only for the specified countries. The filter does not affect which countries are included in the global_blocking_rate computation — that always reflects the full probe network. The filter only affects what is returned in the response body.

Historical data coverage

Voidly's own measurement database covers domains back to August 2024, when the probe network launched. For domains with longer histories, measurements from OONI (2012 to present) and CensoredPlanet (2018 to present) are integrated into the domain history via the cross-source corroboration pipeline. These external measurements are ingested as first-class rows in domain_measurement_summary with their source identified in the source column.

This means that first_blocked_at for a long-blocked domain like Wikipedia in China may be anchored to a 2012 OONI measurement rather than a Voidly probe run. The confidence value for those historical rows reflects the source quality — OONI measurements from 2012 that have been stable in the corpus for years receive a high confidence value; CensoredPlanet measurements that have not been cross-validated with a second source receive a lower value. The corroboration pipeline does not retroactively upgrade confidence based on later evidence — each measurement's confidence is fixed at ingest time based on what was available then.

Coverage density increases significantly post-August 2024 when Voidly's own probes began contributing. For most tracked domains in coverage-rich countries, the post-launch data density is 10 to 50 times higher than the pre-launch OONI/CensoredPlanet baseline. This creates a heteroskedastic time series — the uncertainty on a domain's blocking rate estimate is much lower for recent measurements than for historical ones — which the API communicates via the per-entry confidence field in the timeline response.

Domain freshness and probing frequency

Not all domains are probed at the same frequency. The measurement scheduler assigns each domain to a probe run cadence based on its OONI category and its recent blocking history. A domain that has been clean for 90 days may be probed weekly; a domain with active blocking in any country is probed daily.

The freshness score quantifies how current the available measurements are relative to the domain's target interval:

# freshness_score = days_since_last_measurement / target_interval_days
#
# target_interval_days by category:
#   NEWS_MEDIA, CIRCUMVENTION, HUMAN_RIGHTS   → 1 day
#   SOCIAL_NETWORKS, MESSAGING, LGBTQI+       → 2 days
#   POLITICAL_CONTENT, RELIGIOUS              → 3 days
#   GOVERNMENT, CULTURE                       → 5 days
#   GENERAL (default)                         → 7 days
#
# freshness_score = 1.0  → measurement is exactly at the target interval
# freshness_score < 1.0  → measurement is more recent than required (good)
# freshness_score > 1.0  → measurement is overdue (stale)
# freshness_score > 3.0  → domain flagged for priority rescheduling

def freshness_score(
    days_since_last_measurement: float,
    domain_category: str,
) -> float:
    TARGET_INTERVALS = {
        'news_media': 1,
        'circumvention': 1,
        'human_rights': 1,
        'social_networks': 2,
        'messaging': 2,
        'lgbtqi': 2,
        'political_content': 3,
        'religious': 3,
        'government': 5,
        'culture': 5,
    }
    target = TARGET_INTERVALS.get(domain_category, 7)
    return days_since_last_measurement / target

The freshness score is computed continuously and exposed as a field in the domain history API response. API consumers that need to reason about data currency should check this field before treating a domain's blocking status as current. A news site withfreshness_score = 4.2 has not been measured in over four days despite a 1-day target interval — the absence of recent blocking detections may reflect a measurement gap rather than genuine unblocking.

In addition to the category-based target interval, the scheduler applies a dynamic override: any domain that appears in the cross-country correlation matrix with a recently triggered neighbor (a correlated domain newly blocked in a high-spread-probability country) gets its target interval halved for the subsequent 7-day window. This ensures that the measurement network catches spread events quickly without permanently increasing the probing cost for every domain in a category.


For how per-domain blocking rates aggregate into the per-country censorship index: Voidly's country-level censorship score: aggregating 2.2B probe measurements into the global index →

For how per-ASN analysis uses the same domain measurements to pinpoint ISP-level enforcement: Voidly's per-ASN blocking analysis: distinguishing ISP-level enforcement from nationwide censorship orders →

For the test list of domains that Voidly probes — how domains are selected for measurement: Voidly's URL test list: how we curate the domains that reveal internet censorship →

For the measurement scheduler that determines how frequently each domain is probed: The Voidly measurement scheduler: how we decide which domains to probe and when →