Technical writing
Voidly's block page fingerprint library: detecting censorship signatures across 2,300+ known pages
When an ISP or government blocks a website, one of several things happens at the network layer: the connection is silently dropped (null routing), the TCP session is reset with a forged RST packet, the DNS resolver returns NXDOMAIN, or — the case this article covers — the censor returns a custom HTML page. That HTML page typically says something like “This site has been blocked” or “Access to this content is prohibited under law,” sometimes with government or ISP branding. These are block pages.
Block pages exist because censorship often has a legal or administrative framing. Many jurisdictions require ISPs to notify users when content is blocked, rather than simply making it disappear. Turkey's BTK (Information Technologies and Communication Authority) regulations mandate that blocked sites display a standard notice. Iran's FATA (Cyber Police) produces a distinctive page that its ISPs are required to serve. Germany's FSM age-verification system produces block pages for adult content that ISPs have agreed to filter.
For censorship measurement, block pages are simultaneously useful and tricky. They're useful because they're a definitive signal — if you can identify that a response body is a known block page, you've confirmed interference without needing to reason about DNS consistency or TLS anomalies. They're tricky because ISPs frequently customize the default government template, and the same conceptual block page appears in dozens of variants across carriers in the same country.
Voidly's block page fingerprint library is the component that resolves this. It holds 2,300+ known block-page signatures across 80 countries, supports four matching strategies with different tradeoffs between precision and coverage, and is the data source consulted by the control server comparison stage before any HTTP measurement is labeled as a block.
Why fingerprinting is necessary
A Voidly probe cannot determine from the response alone whether it's looking at a block page or a legitimate server error. A 403 Forbidden can mean “you are blocked” or “you need to authenticate.” A short HTML body can be a catch-all error page from a CDN. An unexpected 200 OK can carry a block page if the ISP returns it with that status code to avoid detection.
The probe's response is always compared against a control measurement — a simultaneous fetch from a Cloudflare-hosted vantage outside the country being measured. When the probe's body diverges from the control's body, the fingerprint library asks: “Is this divergent body something we recognize as a block page?” A positive match elevates the classification confidence immediately. A negative match — body is different from control, but not in the library — falls back to structural heuristics like body length ratio and HTTP status comparison.
Without the fingerprint library, HTTP blocking detection would rely entirely on those structural heuristics, which produce more false positives (short error pages from legitimate servers) and more false negatives (block pages served with 200 OK and similar body length to the original site). The library gives the classifier a near-certain signal when it fires.
Library structure and storage
Each entry in the library is a BlockPageEntry:
@dataclass
class BlockPageEntry:
fp_id: str # "BP-TR-001" — country code + sequential index
country_code: str # ISO 3166-1 alpha-2, e.g. "TR" for Turkey
asn: Optional[int] # carrier ASN, e.g. 9121 (Türk Telekom)
# None if the entry matches across all carriers in country
method: str # "exact_hash" | "simhash" | "structural" | "tls_cert"
hash_value: str # SHA-256 hex or SimHash bit string
similarity_threshold: float # 1.0 for exact, 0.85–0.92 for simhash
added_date: str # ISO 8601, e.g. "2023-04-15"
incident_count: int # confirmed incidents that matched this fingerprint
source: str # "ooni_confirmed" | "censored_planet" | "probe_capture"
# | "analyst_submission"
notes: str # free-text, e.g. "BTK order template, Türk Telekom variant"The fp_id format is BP-<CC>-<NNN>where CC is the ISO country code andNNN is a zero-padded sequential number within that country. Turkey's entries run from BP-TR-001through BP-TR-047 at current count. Iran runs from BP-IR-001 throughBP-IR-312, reflecting a much larger corpus of ISP-specific variants.
The library is stored as a SQLite table on Cloudflare D1, replicated to all three control server nodes (US-East, EU-West, AP-East). Replication is push-based: when a new entry is added or modified in the primary D1 instance, the change is written to the other two within 30 seconds via a Cloudflare Worker that runs on the primary node. The D1 table is append-only for fingerprint data — entries are never deleted, only soft-deleted with a retired_datefield when they're removed from active matching.
The SQLite schema:
CREATE TABLE block_page_fingerprints (
fp_id TEXT PRIMARY KEY,
country_code TEXT NOT NULL,
asn INTEGER,
method TEXT NOT NULL CHECK(method IN (
'exact_hash', 'simhash', 'structural', 'tls_cert'
)),
hash_value TEXT NOT NULL,
similarity_threshold REAL NOT NULL DEFAULT 1.0,
added_date TEXT NOT NULL,
retired_date TEXT, -- NULL = active
incident_count INTEGER NOT NULL DEFAULT 0,
source TEXT NOT NULL,
notes TEXT NOT NULL DEFAULT ''
);
CREATE INDEX idx_bpf_country ON block_page_fingerprints(country_code)
WHERE retired_date IS NULL;
CREATE INDEX idx_bpf_asn ON block_page_fingerprints(asn)
WHERE retired_date IS NULL AND asn IS NOT NULL;
CREATE INDEX idx_bpf_exact_hash ON block_page_fingerprints(hash_value)
WHERE method = 'exact_hash' AND retired_date IS NULL;Four matching strategies
Block pages don't stay constant. ISPs rebrand them, governments update the legal notice template, and carriers tweak the styling. The library uses four matching strategies at different points in the pipeline to handle the range of variation.
1. Exact hash match
The fastest and most precise strategy. SHA-256 of the raw response body is computed and looked up directly against the index onhash_value where method = 'exact_hash'. This works for block pages that are entirely static — no dynamic content, no timestamp, no reflected URL. Turkey's BTK default template historically fell in this category, as do most government-mandated pages in Russia that predate 2022.
Exact hash matches carry a match score of 1.0 and are treated as definitive. Roughly 60% of all library entries are exact hash entries, and they account for the majority of matches in production because the static pages are served at high volume.
2. Structural match
Many block pages embed dynamic fields: the blocked URL, the user's source IP address, a request timestamp, or a unique complaint reference number. These fields make the raw body different for every request, defeating the exact hash. Structural matching normalizes them out before hashing.
The normalization pass:
import re
from typing import Optional
# Patterns for fields that vary per-request in known block page templates
_DYNAMIC_PATTERNS = [
# Reflected URL in block page
(re.compile(r'(?:url|URL|href|src)=["']https?://[^s"'<>]+["']'), 'URL_REDACTED'),
# IPv4 addresses (visitor IP, server IP)
(re.compile(r'(?:d{1,3}.){3}d{1,3}'), 'IP_REDACTED'),
# ISO 8601 timestamps
(re.compile(r'd{4}-d{2}-d{2}[T ]d{2}:d{2}:d{2}'), 'TS_REDACTED'),
# Unix epoch timestamps (10 or 13 digits)
(re.compile(r'd{10,13}'), 'EPOCH_REDACTED'),
# UUIDs (complaint/reference numbers)
(re.compile(
r'[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}',
re.IGNORECASE
), 'UUID_REDACTED'),
# Query strings in any link
(re.compile(r'?[A-Za-z0-9&=%_-+.]+'), '?QUERY_REDACTED'),
]
def normalize_html(body: bytes) -> bytes:
"""Strip dynamic per-request fields from a block page body."""
text = body.decode('utf-8', errors='replace')
for pattern, replacement in _DYNAMIC_PATTERNS:
text = pattern.sub(replacement, text)
# Collapse whitespace variations that don't affect content
text = re.sub(r's+', ' ', text).strip()
return text.encode('utf-8')The normalized body is then SHA-256 hashed and looked up against entries withmethod = 'structural'. Structural matches carry a score of 0.97 — high confidence, but slightly below exact because in rare cases two different pages can normalize to the same string.
3. SimHash locality-sensitive hashing
Some ISPs take a government-supplied template and rebrand it — change the logo, add their own contact information, adjust the color scheme. The result is a page that is clearly the same block page conceptually, but different enough in raw content that neither exact nor structural matching fires. SimHash handles these variants by finding pages that are 80% or more similar rather than identical.
SimHash works by computing a weighted sum of feature hashes across the document, then comparing the resulting bit strings. Two similar documents produce bit strings that differ in few positions — the Hamming distance between them is small. For block page detection, we use 64-bit SimHashes computed over 3-gram shingles of the normalized HTML:
from simhash import Simhash
def compute_simhash(body: bytes) -> int:
"""Compute 64-bit SimHash of normalized body."""
normalized = normalize_html(body).decode('utf-8', errors='replace')
# 3-gram character shingles provide better discrimination than token-level
shingles = [normalized[i:i+3] for i in range(len(normalized) - 2)]
return Simhash(shingles, f=64).value
def hamming_similarity(h1: int, h2: int, bits: int = 64) -> float:
"""Return 0.0–1.0 similarity from Hamming distance."""
xor = h1 ^ h2
distance = bin(xor).count('1')
return 1.0 - (distance / bits)The SimHash lookup is the most expensive step in the pipeline. A naive scan of all SimHash entries would require computing Hamming distance against every entry in the library, which is impractical at query time. We use a banding technique: the 64-bit hash is split into 8 bands of 8 bits each. For each band, we index the band value. A candidate match must share at least one band with the query hash — this limits the scan to a small fraction of the library for most queries.
We further narrow the SimHash candidate search by filtering oncountry_code before computing Hamming distance. A probe in Turkey is extremely unlikely to encounter an Iranian block page; the country filter reduces candidate set size by roughly 96% in the common case.
SimHash entries use a similarity_threshold field to record the threshold at which that specific fingerprint was validated. Most are set at 0.85 (ISP template variants). A handful of entries for countries where the template is highly consistent are set at 0.90 or 0.92. Matches at or above the stored threshold carry a score equal to the measured similarity.
4. TLS certificate fingerprinting
A minority of block pages are served over HTTPS with self-signed or operator-issued certificates. When an ISP intercepts traffic to redirect it to a block page, it must present a TLS certificate for the domain being blocked. Some ISPs use a self-signed certificate with a consistent subject distinguished name or a consistent public key fingerprint across all their block page deployments.
TLS fingerprint entries store the SHA-256 fingerprint of the leaf certificate rather than the body hash. They are matched at the TLS comparison stage — before any HTTP fetch — by comparing the certificate the probe received against the TLS certificate entries in the library for that country. This catches cases where the block page content itself varies but the certificate is constant.
def check_tls_fingerprint(
cert_der: bytes,
country_code: str,
asn: Optional[int]
) -> Optional[BlockPageEntry]:
cert_fp = sha256(cert_der).hexdigest()
return db.query_one(
"""
SELECT * FROM block_page_fingerprints
WHERE method = 'tls_cert'
AND hash_value = ?
AND country_code = ?
AND retired_date IS NULL
AND (asn IS NULL OR asn = ?)
LIMIT 1
""",
(cert_fp, country_code, asn)
)TLS fingerprint entries currently account for about 4% of the library. They are disproportionately valuable because they fire early in the pipeline — a TLS certificate match short-circuits the rest of the comparison and flags the measurement as a block before an HTTP body is even fetched.
The match pipeline
The four strategies are applied in order, fastest first. Most matches are found in the first two steps; SimHash and TLS certificate lookups handle edge cases:
from hashlib import sha256
from dataclasses import dataclass
from typing import Optional
@dataclass
class BlockPageMatch:
entry: BlockPageEntry
score: float # 0.0–1.0
method: str # which strategy matched
def check_blockpage(
response_body: bytes,
country_code: str,
asn: Optional[int]
) -> Optional[BlockPageMatch]:
body_hash = sha256(response_body).hexdigest()
# 1. Exact hash (fast path — O(1) index lookup)
if entry := db.query_exact(body_hash, method='exact_hash'):
db.increment_incident_count(entry.fp_id)
return BlockPageMatch(entry, score=1.0, method='exact')
# 2. Structural match (normalize dynamic fields, then exact hash)
normalized = normalize_html(response_body)
norm_hash = sha256(normalized).hexdigest()
if entry := db.query_exact(norm_hash, method='structural'):
db.increment_incident_count(entry.fp_id)
return BlockPageMatch(entry, score=0.97, method='structural')
# 3. SimHash — narrow to same country to limit candidate set
probe_sh = compute_simhash(response_body)
candidates = db.query_simhash_near(
simhash=probe_sh,
country_code=country_code,
asn=asn,
min_threshold=0.80,
)
if candidates:
best = max(candidates, key=lambda c: c.similarity)
if best.similarity >= best.entry.similarity_threshold:
db.increment_incident_count(best.entry.fp_id)
return BlockPageMatch(best.entry, score=best.similarity, method='simhash')
# 4. TLS cert fingerprint is checked separately at TLS comparison stage.
# By this point in check_blockpage(), TLS has already been evaluated;
# returning None here means no HTTP-body match was found.
return NoneThe pipeline returns the first match found, not the best possible match across all strategies. In practice, the ordering corresponds to confidence: an exact hash match is always more certain than a structural match, which is always more certain than an 87% SimHash similarity. Returning early on the exact match avoids the cost of running the slower strategies when they aren't needed.
When a match fires, the incident_count field on the matched entry is incremented asynchronously (a Cloudflare D1 write queued via the Worker). This count is used in the weekly audit to identify stale fingerprints.
How block pages are collected
The library grows through three primary channels:
OONI confirmed block pages. OONI's published dataset includes aconfirmed flag on measurements where the response body was matched against OONI's own block-page corpus. When we ingested the OONI historical corpus, we extracted every unique body hash from confirmed-flagged measurements, deduped them, and bulk-imported them assource = 'ooni_confirmed' entries. This seeded the library with about 1,400 signatures covering the period from 2015 through 2023.
CensoredPlanet HTTP request data. CensoredPlanet's Hyperquack and Quack datasets include HTTP response bodies for measurements that diverged from controls. We run these through our normalization and hashing pipeline and cross-reference them against incidents confirmed through other means (government announcements, news reporting, OONI corroboration). Bodies that appear in multiple confirmed incidents from the same country and ASN are submitted to the fingerprint pipeline as source = 'censored_planet'.
Direct probe captures. When a Voidly probe measurement diverges from the control and no fingerprint fires, but the body length ratio and HTTP status suggest a block (ratio < 0.15, status in{403, 451, 302}), the raw body is queued for analyst review. The pipeline automatically assigns a provisionalhttp_block_soft flag to the measurement while the body is under review. If the analyst confirms the body is a genuine block page, a new fingerprint entry is created withsource = 'probe_capture'.
Every new ISP-specific fingerprint — the first time we see a carrier's block page — requires manual analyst confirmation before it enters the active library. Subsequent variants of a confirmed fingerprint from the same country can be added automatically if they exceed the SimHash similarity threshold for the confirmed entry. This semi-automated expansion accounts for most of the ~50 new signatures added per month.
False positive sources and mitigations
Three categories of legitimate pages produce bodies that could be mistaken for block pages without careful mitigation:
CDN error pages. Cloudflare's 1020 (Access Denied) page and AWS's 503 service unavailable page are short, recognizable HTML bodies that appear whenever a CDN's bot-detection or rate-limiting fires. If these were fingerprinted as block pages, any probe getting a routine Cloudflare challenge would be flagged as a censorship event. We maintain an explicit allow-list of CDN error page hashes; entries in the allow-list are checked before the block-page library and cause the measurement to be classified as http_error (legitimate) rather than http_block.
Captcha pages. Many sites serve interactive challenges (Google reCAPTCHA, hCaptcha, Cloudflare Turnstile) when they suspect automated access. A probe receiving a captcha sees a divergent body from the control, but this is not censorship. Captcha bodies are identified by known challenge provider signatures in the HTML and classified ashttp_block_soft — a soft block indicating access limitation rather than outright censorship. Thehttp_block_soft field is included in the Voidly dataset and is distinct from the hardblockpage_match field.
ISP login portals. Hotel, airport, and campus Wi-Fi networks frequently intercept HTTP traffic and redirect it to a login or payment portal. A probe running on such a network would see divergent bodies for every domain it tests. This is detected by checking whether multiple topically unrelated domains return the same body hash — if ten different domains all return identical bodies, the probe is behind a captive portal, not experiencing selective censorship. Probe sessions that trigger this heuristic are marked captive_portal_detected: trueand excluded from the fingerprint pipeline for that session.
Per-country library composition
The library is not evenly distributed across countries. Countries that censor more, and whose censors return block pages rather than silently dropping traffic, account for the bulk of the signatures.
Turkey (TR) — 47 signatures. Turkey's BTK issues blocking orders to ISPs under Law No. 5651. The BTK provides a reference block page template, but each of Turkey's major carriers (Türk Telekom ASN 9121, Vodafone Turkey ASN 15897, Turkcell ASN 34984) has customized it. We have 47 variants in the library: the base BTK template and 46 carrier-specific modifications identified through SimHash clustering of probe captures. The Türk Telekom variant is the most common, accounting for roughly 60% of Turkey block-page matches by incident volume.
Iran (IR) — 312 signatures. Iran has the most diverse block page corpus in the library, reflecting a fragmented censorship infrastructure where each ISP operates its own filtering system. FATA (Cyber Police) produces a “403 Forbidden” family of pages that is recognizable across carriers, but individual ISPs add significant variation. The 312 Iranian signatures include FATA templates, ISP-specific variants, and several pages from the Ministry of Culture and Islamic Guidance that appear on media and news content specifically.
Russia (RU) — 189 signatures. Roskomnadzor operates Russia's centralized filtering system (TSPU, introduced under the “Sovereign Internet” law). The TSPU produces a standard block page template, but Russian ISPs also maintain their own lists and serve their own pages for operator-determined blocks. The 189 signatures cover both the Roskomnadzor TSPU template and operator-specific variants. The signature count grew significantly during 2022 as new categories of content were blocked following legislative changes.
China (CN) — 8 signatures. China's Great Firewall primarily uses TCP RST injection and DNS NXDOMAIN — it rarely returns block pages. The 8 signatures we have come from WeChat's in-app browser content moderation system (which produces a specific block notice for flagged URLs opened within WeChat) and from a small number of provincial ISPs that use block pages for a narrow category of content. GFW-level blocks in China are almost never detected through block-page fingerprinting; they require TCP or DNS analysis.
Germany (DE) — 12 signatures. Germany's FSM (Freiwillige Selbstkontrolle Multimedia-Diensteanbieter) operates an age-verification and content-filtering framework for adult material. German ISPs that participate serve age-gate and block pages for content in restricted categories. These 12 signatures represent legitimate regulatory blocking rather than political censorship, and the measurement dataset annotates them with a block_reason: age_restriction field in addition to the standard fingerprint fields. They are not excluded from the dataset but are flagged so researchers can distinguish regulatory filtering from political suppression.
Integration with the classifier
Block page fingerprint results enter the classifier pipeline through two fields that appear in every Voidly measurement record. Theblockpage_match boolean indicates whether any fingerprint fired for that measurement. Theblockpage_fp_id string, when present, records which specific fingerprint matched (e.g., BP-TR-003for the Türk Telekom variant of the BTK template).
In the ML training pipeline, blockpage_match = Trueis used as a weak supervision label via the Snorkel label functionlf_http_blockpage_hash. The label function has a weight of 0.97 — near-certain, but not 1.0, because in very rare cases a block page fingerprint can fire against a legitimate page that happens to use similar template HTML (we have observed this once with a public-sector site in Finland that inadvertently matched an Estonian government block-page structural hash).
# Snorkel label function used in ML training pipeline
from snorkel.labeling import labeling_function
from snorkel.labeling import ABSTAIN, BLOCK, ACCESSIBLE
@labeling_function()
def lf_http_blockpage_hash(row) -> int:
"""
Near-certain HTTP block indicator: response body matches a known block page fingerprint.
Weight: 0.97 — high confidence, not 1.0 to account for the rare false positive.
"""
if row.blockpage_match is True and row.blockpage_fp_id is not None:
return BLOCK
return ABSTAINThe blockpage_fp_id field also allows downstream analysis beyond binary classification. A researcher querying the Voidly dataset can group by blockpage_fp_id to see how many measurements matched a specific ISP's block page template, or track when a carrier changed templates (a shift in matched fp_idacross time for the same country and ASN).
Update cadence and library maintenance
The fingerprint library is a living dataset. Censors update their templates, new ISPs start enforcing blocks, and old block pages go out of rotation. We maintain it through three update cycles:
Continuous additions. New signatures are added within 48 hours of a new block page body first appearing in a confirmed incident. Confirmed means corroborated: either the block-page probe capture is from a country and domain where OONI or CensoredPlanet independently flags interference in the same 24-hour window, or the domain is confirmed blocked by official government announcement. We do not add fingerprints from single probe captures alone.
Weekly audit. Every Monday, a script queries all active fingerprints with incident_count = 0in the trailing 180 days (6 months). These candidates are reviewed: if no active blocking incident in that country uses the fingerprint, it is retired by setting retired_date to the current date. Retired entries remain in the database but are excluded from active matching. This prevents the library from accumulating stale fingerprints from ISPs that have changed their templates, which would waste compute on SimHash candidates that never match.
Emergency additions. Major shutdown events — internet blackouts accompanying political crises — often introduce new block pages at scale as censors respond to rapidly changing content. During the Myanmar military coup in 2021 and the Iranian protests following Mahsa Amini's death in 2022, we ran expedited addition pipelines: probe captures from those countries were prioritized for analyst review within hours, and confirmed fingerprints were added within the same day rather than waiting 48 hours. The emergency pipeline bypasses the 48-hour window but still requires corroboration from at least one external source.
The library currently grows at roughly 50 new signatures per month under the standard cadence. Emergency periods contribute bursts of 30–80 signatures over a few days. The 2,300-entry count reflects cumulative active entries; the total database including retired entries is approximately 2,900.
Related articles
For how the control server consults this library during the HTTP comparison stage: The Voidly control server: how we tell censorship from a bad network →
For the lf_http_blockpage_hash label function and how block page matches feed weak supervision: The Voidly anomaly classifier: five interference classes and why we optimize for recall →
For how block page match labels are used as weak supervision signals in classifier training: Voidly's ML training pipeline: building a labeled censorship dataset from OONI measurements →
For the blockpage_match and blockpage_fp_id field definitions in the published dataset: The Voidly measurement dataset: field-by-field schema reference →
For how the DNS layer detects censorship before any HTTP block page can appear — NXDOMAIN injection, IP spoofing, and resolver-level filtering: How Voidly measures DNS censorship: NXDOMAIN injection, IP spoofing, and resolver-level filtering →