Technical writing

Voidly's measurement retention policy: hot, warm, and cold tiers for 2.2B probe results

March 5, 2025· 14 min read· AI Analytics

CensorshipVoidlyInfrastructureData Engineering

The TimescaleDB setup article mentioned a three-tier retention policy in passing. This article is the full technical account of how that policy works, why the thresholds are where they are, and what happens to measurement data at each stage of its life. The short version: 2.2 billion measurements are kept at full resolution for 30 days, compressed for another 335 days, then reduced to daily aggregates forever — with continuous aggregates surviving the raw data deletion so that long-range trend queries remain fast.

The retention problem

Voidly's probe network generates roughly 47 million measurements per day. Each raw measurement row is approximately 120 bytes uncompressed, which puts daily ingest at around 5.6 GB of new uncompressed data. Over the 2.2 billion measurements accumulated to date, that totals approximately 264 GB of raw data — before accounting for indexes, TOAST storage for variable-length fields, and TimescaleDB's per-chunk metadata.

Without a retention policy, two problems compound over time. First, storage costs scale linearly with data volume: at 5.6 GB/day uncompressed, a five-year dataset would require roughly 10 TB of raw PostgreSQL storage. Second, query performance degrades on the oldest data because index scans across five years of 1-day chunks require scanning hundreds of chunks even for a tightly bounded time range, and PostgreSQL's planner overhead grows with chunk count regardless of pruning effectiveness.

The three-tier policy addresses both. Hot data (0–30 days) is stored at full resolution in uncompressed hypertable chunks and is what the real-time pipeline reads from. Warm data (31–365 days) is stored in the same hypertable but in compressed, read-only chunks that achieve a 6.2× size reduction. Cold data (older than 365 days) exists only as daily aggregate summaries — the raw measurement rows are deleted entirely. The combination keeps the live TimescaleDB instance at a manageable size while preserving full historical trend visibility through the aggregate layer.

Hot tier — full resolution in the active hypertable

The most recent 30 days of measurements are stored at full resolution: one row per measurement, all fields populated, no compression. TimescaleDB creates 1-day chunks automatically based on the hypertable's chunk_time_interval setting. At 47 million rows per day and approximately 120 bytes per row, each day's chunk is roughly 4.8 GB uncompressed before any index overhead.

All fields are available in the hot tier, including the raw HTTP body hash (http_body_sha256), TLS certificate fingerprint fields (tls_cert_valid, tls_cert_sni, tls_alert_code), and per-layer timing such as tcp_connect_ms. These fields are critical for the anomaly classifier, which needs body hashes to detect blockpage substitution and certificate fingerprints to detect TLS interception — both of which are discarded in the warm and cold tiers where row-level detail is not preserved.

Query latency against hot-tier data is p50 12 ms and p99 180 ms for single-country, single-day queries. The real-time pipeline that drives anomaly detection reads exclusively from the hot tier and relies on this latency profile for sub-minute detection windows.

Warm tier — native compression

Measurements older than 30 days are moved to the warm tier via TimescaleDB's native columnar compression. The compression configuration specifies which columns to use as segment keys (grouping rows with the same values for maximum dictionary compression efficiency) and the sort order within each segment:

ALTER TABLE measurements SET (
  timescaledb.compress,
  timescaledb.compress_segmentby = 'country_code, probe_type',
  timescaledb.compress_orderby = 'measured_at DESC, domain'
);

SELECT add_compression_policy('measurements', INTERVAL '30 days');

The compress_segmentby choice of country_code andprobe_type groups rows from the same country and protocol type into the same column store segment. Because country_code has only around 200 distinct values and probe_type has five, dictionary encoding achieves extremely high ratios on these columns (40× and 50× respectively). The compress_orderby of measured_at DESC, domain ensures that delta encoding on timestamps is maximally effective — consecutive timestamps in the same chunk differ by small amounts — and that domain lookups within a segment benefit from sorted order.

The overall compression ratio measured across 90 days of production data is 6.2×. Compressed chunks are read-only from PostgreSQL's perspective; queries against warm data decompress the relevant column segments on the fly into a temporary buffer. This decompression overhead raises query latency relative to the hot tier: p50 45 ms and p99 820 ms for the same single-country, single-day query pattern. The real-time pipeline never queries warm data — it reads only from the hot tier — so this latency is only visible to ad-hoc historical queries and the continuous aggregate refresh worker.

Cold tier — downsampled aggregates only

Measurements older than 365 days are represented exclusively by their daily aggregate summaries in the measurement_daily_summary continuous aggregate. The raw measurement rows — including body hashes, certificate fingerprints, per-layer timings, and individual probe results — are permanently deleted by TimescaleDB's retention policy:

SELECT add_retention_policy('measurements', INTERVAL '365 days');

The daily aggregate that survives includes: blocking_rate (fraction of measurements with interference_type != 'none'), probe_count(distinct probes that contributed), interference_type distribution (a JSONB breakdown across the seven interference categories), p_blocked median (median interference probability across all measurements for that day), andconfidence_weighted_avg (average interference probability weighted by each measurement's confidence tier). What it does not include: any per-measurement fields such as body_sha256 or certificate_fingerprint — these are gone from the live database after the 365-day window.

The data reduction is substantial. Each day in the hot or warm tier contributes 47 million rows. Each day in the cold tier contributes approximately one row per (country, domain, probe_type_group) triple. At the scale of Voidly's test list (roughly 50,000 domains tested across 70 countries per probe type), that is approximately 3.5 million aggregate rows per day versus 47 million raw rows — a reduction of about 13× at the row level. Across the full cold tier (all data older than 365 days), the aggregate layer holds roughly one row per (country, domain, probe_type_group) per day, compared to 47,000 raw measurements per day for that combination on average. The effective row-count reduction for data older than one year is approximately 47,000×.

Continuous aggregate cascade

The retention policy works because continuous aggregates are independent TimescaleDB objects — they survive raw data deletion and are refreshed from the raw data while it exists. Three cascade levels ensure that no matter how old the data, some form of aggregated view remains queryable:

-- Level 1: hourly summaries — retained 90 days, then dropped
CREATE MATERIALIZED VIEW measurement_hourly_summary
WITH (timescaledb.continuous) AS
SELECT
    time_bucket('1 hour', measured_at)          AS bucket,
    country_code,
    domain,
    probe_type,
    COUNT(*)                                    AS measurement_count,
    AVG(interference_prob)                      AS p_blocked_avg,
    COUNT(*) FILTER (WHERE interference_type != 'none')
                                                AS interference_count,
    jsonb_object_agg(
        interference_type, type_count
    ) FILTER (WHERE interference_type IS NOT NULL)
                                                AS interference_type_dist
FROM measurements,
     LATERAL (
         SELECT interference_type,
                COUNT(*) AS type_count
         FROM measurements m2
         WHERE m2.measured_at = measurements.measured_at
         GROUP BY interference_type
     ) type_counts
WHERE inference_dropped IS NULL
GROUP BY 1, 2, 3, 4;

SELECT add_retention_policy('measurement_hourly_summary', INTERVAL '90 days');

-- Level 2: daily summaries — retained indefinitely
CREATE MATERIALIZED VIEW measurement_daily_summary
WITH (timescaledb.continuous) AS
SELECT
    time_bucket('1 day', measured_at)           AS day,
    country_code,
    domain,
    probe_type,
    SUM(measurement_count)                      AS measurement_count,
    SUM(interference_count)::float /
        NULLIF(SUM(measurement_count), 0)       AS blocking_rate,
    AVG(p_blocked_avg)                          AS p_blocked_median,
    COUNT(DISTINCT probe_id)                    AS probe_count,
    AVG(interference_prob * confidence_weight)
        / NULLIF(AVG(confidence_weight), 0)     AS confidence_weighted_avg,
    jsonb_agg(interference_type_dist)           AS interference_type_distribution
FROM measurement_hourly_summary
GROUP BY 1, 2, 3, 4;

-- No retention policy — this view is permanent

-- Level 3: monthly country rollups — retained indefinitely
CREATE MATERIALIZED VIEW country_monthly_summary
WITH (timescaledb.continuous) AS
SELECT
    time_bucket('1 month', day)                 AS month,
    country_code,
    SUM(measurement_count)                      AS total_measurements,
    SUM(measurement_count * blocking_rate)::float /
        NULLIF(SUM(measurement_count), 0)       AS global_blocking_rate,
    COUNT(DISTINCT domain)                      AS distinct_domains_tested
FROM measurement_daily_summary
GROUP BY 1, 2;

-- No retention policy — this view is permanent

The cascade means that as data ages: within 90 days, hourly granularity is available; beyond 90 days, only daily granularity; beyond 365 days, only the daily and monthly aggregates from Level 2 and Level 3. Trend analysis going back years remains fast because the country_monthly_summary has at most a few thousand rows per country — the full five-year monthly history for all 200 countries fits in a few hundred thousand rows total.

Storage accounting

With the three-tier policy in place, the live TimescaleDB instance holds the following approximate storage distribution:

Tier                     Days covered    Approx size    Growth rate
──────────────────────────────────────────────────────────────────────
Hot (uncompressed)       0–30            ~144 GB        4.8 GB/day
Warm (compressed 6.2x)   31–365          ~55 GB         Constant (365-day window)
Cold (aggregates only)   >365            ~12 GB         ~1 GB/year
──────────────────────────────────────────────────────────────────────
Total                                    ~211 GB        ~4.8 GB/day (hot only)

The hot tier grows at 4.8 GB/day but also drops a 30-day-old chunk every day, so its steady-state size is constant at approximately 144 GB (30 days × 4.8 GB/day). The warm tier is similarly steady-state: it holds a rolling 335-day window of compressed data. The cold tier grows slowly — daily aggregate rows are small and accumulate at roughly 1 GB per year across all country-domain-probe_type combinations. The total live database footprint is therefore bounded at approximately 211 GB plus index overhead, regardless of how long Voidly has been running.

The compression job schedule

TimescaleDB compression is executed by a background worker process that runs on a schedule set by the compression policy. The policy checks every hour for chunks eligible for compression — specifically, chunks whose range_end is more than 30 days in the past and that are not yet marked is_compressed. Under normal operation this means one chunk is compressed per day, shortly after it ages past the 30-day threshold.

During compression a chunk is briefly locked for reads. The lock window is 2–3 seconds for a typical 4.8 GB chunk on the production instance (an r6g.2xlarge with NVMe-backed EBS). The real-time pipeline handles this via retry with exponential backoff: the first retry is attempted after 200 ms, subsequent retries after 400 ms, 800 ms, and so on up to a 10-second cap. In practice the compression lock never exceeds 3 seconds, so at most two retries are needed. No measurements are lost during compression — the retry logic reads from an in-memory buffer that accumulates probe events during the backoff window.

Compression can also be triggered manually for operational reasons (for example, before a planned maintenance window to reduce the data volume that needs to be backed up):

-- Manually compress all eligible chunks immediately
SELECT compress_chunk(c.schema_name || '.' || c.table_name)
FROM timescaledb_information.chunks c
WHERE c.hypertable_name = 'measurements'
  AND c.range_end < NOW() - INTERVAL '30 days'
  AND NOT c.is_compressed;

Verifying retention with pg_cron

The retention policy is critical enough that its correct operation is verified independently of TimescaleDB's own internal job scheduling. A pg_cron job runs daily at 04:00 UTC and checks for two classes of compliance failures: uncompressed chunks that should have been compressed, and unretained chunks older than 365 days that should have been dropped.

-- Check that no uncompressed chunks older than 30 days exist
SELECT chunk_schema, chunk_name, range_start, is_compressed
FROM timescaledb_information.chunks
WHERE hypertable_name = 'measurements'
  AND range_end < NOW() - INTERVAL '30 days'
  AND NOT is_compressed;

-- Check that no raw chunks older than 365 days exist
SELECT chunk_schema, chunk_name, range_start
FROM timescaledb_information.chunks
WHERE hypertable_name = 'measurements'
  AND range_end < NOW() - INTERVAL '365 days';

If either query returns rows, a PagerDuty alert fires with severity P2 (not P1, because storage is not immediately exhausted — a delayed compression or retention job is recoverable). The alert includes the chunk names and their range_starttimestamps so the on-call engineer can immediately identify which chunk failed to process and run the appropriate manual recovery command. In eleven months of production operation, this alert has fired twice — both times due to a stuck TimescaleDB background worker that was resolved by a pg_cancel_backend call against the stalled job process.

Research data and the retention policy

The hot/warm/cold retention policy applies exclusively to the live TimescaleDB instance. It does not affect the public Parquet exports published to HuggingFace and stored on Cloudflare R2 — those are independent files in object storage and are never deleted by the retention policy. The Parquet export pipeline runs nightly and archives raw measurement data permanently; a researcher who needs individual measurement rows for a domain that was tested three years ago will find them in the Parquet dataset even though those rows no longer exist in the live database.

This split means the Voidly API's behavior differs by data age. The/v1/measurements endpoint returns full row-level data for queries within the 365-day window. For queries targeting data older than 365 days, the endpoint returns a structured error directing the caller to the Parquet dataset:

{
  "error": "DATA_NOT_AVAILABLE",
  "message": "Raw measurement data older than 365 days is not available via the API.",
  "resolution": "Use the Parquet dataset on HuggingFace for historical raw measurements.",
  "dataset_url": "https://huggingface.co/datasets/voidly/measurements",
  "aggregate_available": true,
  "aggregate_endpoint": "/v1/aggregate/daily?country=IR&domain=twitter.com&since=2023-01-01"
}

The aggregate_available: true flag and theaggregate_endpoint field tell callers that while raw rows are gone from the live database, the daily aggregate for the requested country and domain is still queryable via the aggregate endpoint. Most API consumers that are doing trend analysis — rather than incident verification — can satisfy their needs from the aggregate endpoint without downloading Parquet files.

Future: tiered storage to object storage

TimescaleDB Cloud's Tiered Storage feature moves compressed chunks transparently to S3-compatible object storage, keeping them queryable via the standard PostgreSQL interface without requiring them to reside on the primary EBS volume. Queries against tiered chunks incur higher latency (typically 200–800 ms per chunk scanned, depending on object storage throughput) but otherwise return correct results through the same SQL interface.

As the Voidly dataset grows toward 10 billion measurements — estimated around Q3 2026 at current growth rates — the warm tier will outgrow what is cost-effective to store on EBS. The plan is to migrate warm-tier compressed chunks to Cloudflare R2 via the TimescaleDB tiered storage adapter. R2 has no egress fees, which is the dominant operational concern: warm-tier chunks are read infrequently (primarily by the continuous aggregate refresh worker and by ad-hoc historical queries) but in large volumes when read.

The cold tier — daily aggregates in the continuous aggregate views — is small enough (~12 GB, growing at ~1 GB/year) that it will remain on the primary PostgreSQL instance indefinitely. The aggregate layer is the part of the system that most needs to be fast for interactive queries, and keeping it local avoids any object storage round-trip latency for the trend analysis queries that are the most common research use case.

For the TimescaleDB setup that this retention policy builds on — hypertable design and continuous aggregates: Voidly's measurement database: 2.2B probe results in TimescaleDB →

For the public Parquet exports that preserve raw data beyond the 365-day retention window: The Voidly Parquet export pipeline: nightly snapshots from TimescaleDB to HuggingFace →

For the probe ingest pipeline that writes measurements into the hot tier: Voidly's probe-to-dataset ingest pipeline: normalization, quality filtering, and TimescaleDB indexing →

For the real-time pipeline that reads from the hot tier for anomaly detection: Voidly's real-time event pipeline: from measurement anomaly to journalist alert in under 8 minutes →