Technical writing

How We Process 2.4M Social Media Posts Per Hour

August 30, 2024· 14 min read· AI Analytics

KafkaTimescaleDBNLPInfrastructure

Our OSINT platform runs 24/7, processing ~58 million social media posts per day. This is a deep dive into the infrastructure that makes it possible.

The Scale

2.4 million posts per hour = 667 posts per second sustained. Peak traffic is 3x that (~2,000 posts/sec) during major events. The system has to handle both the steady state and the spikes.

Each post requires:

Data extraction and normalization
Language detection
Entity recognition (people, places, organizations)
Sentiment classification
Duplicate detection
Storage and indexing

Total processing time: ~60ms per post. At 667 posts/sec, that's 40 CPU-seconds of work per wall-clock second. Parallelization is mandatory.

Infrastructure Overview

Scrapers (47 workers)
    ↓
Kafka (24 partitions, 7-day retention)
    ↓
NLP Workers (80 GPU instances)
    ↓  
TimescaleDB (primary storage)
    ↓
Redis (caching + real-time queries)
    ↓
API Layer (read-only, public facing)

Everything runs on AWS. Total infrastructure cost: $31K/month.

Data Ingestion

We scrape 47 platforms. Mix of official APIs and browser automation.

Twitter: Rate limited to 5,000 posts per 15 minutes per app. We run 40 apps (40 authenticated accounts) in rotation. Each scraper polls every 30 seconds. Gives us ~800K tweets/hour capacity. In practice, we filter to election-relevant keywords and get ~300K/hour.

Facebook: Official API is worthless (only public pages, heavily rate-limited). We use Playwright to drive headless Chrome. Each scraper logs in as a real user and scrolls feeds. Rotating proxies to avoid bans. Hacky but works. ~150K posts/hour.

Reddit: Excellent API. Using Pushshift dumps for historical data. Real-time via streaming API. ~200K posts/hour.

TikTok: Nightmare to scrape. No official API. Their anti-bot measures are sophisticated. We use residential proxies + device fingerprinting to look like mobile apps. Success rate ~60%. ~100K posts/hour.

Remaining platforms (Instagram, YouTube, Telegram, Discord, forums, etc.): ~1.8M posts/hour combined.

Kafka Configuration

All scrapers push to Kafka. This decouples ingestion from processing—critical for handling spikes.

# Kafka topic configuration
topic: social_media_posts
partitions: 24
replication_factor: 3
retention: 7 days (168 hours)
compression: lz4

# Producer settings (scrapers)
acks: 1  # Leader acknowledge, don't wait for replicas
compression_type: lz4
batch_size: 100KB
linger_ms: 50  # Wait up to 50ms to batch messages

# Consumer settings (NLP workers)
group_id: nlp_workers
auto_offset_reset: earliest
max_poll_records: 500

24 partitions chosen to match AWS availability zones (8 zones × 3 brokers). Replication factor of 3 means we can lose 2 brokers without data loss.

7-day retention is crucial. When we update NLP models, we replay the last 7 days through the new pipeline. Instant reprocessing of recent data.

NLP Worker Design

80 workers, each running on g4dn.xlarge (4 vCPU, 16GB RAM, 1 NVIDIA T4 GPU).

Each worker:

import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer

class NLPWorker:
    def __init__(self):
        # Load models once at startup
        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

        # Sentiment model (DistilBERT)
        self.sentiment_model = AutoModelForSequenceClassification.from_pretrained(
            "distilbert-election-sentiment"
        ).to(self.device)

        # NER model (SpaCy)
        self.ner = spacy.load("en_election_ner")

        # Language detector (FastText)
        self.lang_detector = fasttext.load_model("lid.176.bin")

    def process_batch(self, posts):
        """Process a batch of posts"""
        results = []

        for post in posts:
            # 1. Detect language (3ms)
            lang = self.lang_detector.predict(post.text)[0][0]

            if lang not in ['en', 'es', 'zh']:
                continue  # Skip non-target languages

            # 2. Extract entities (12ms)
            doc = self.ner(post.text)
            entities = {
                'people': [e.text for e in doc.ents if e.label_ == 'PERSON'],
                'orgs': [e.text for e in doc.ents if e.label_ == 'ORG'],
                'locations': [e.text for e in doc.ents if e.label_ == 'GPE']
            }

            # 3. Sentiment classification (45ms)
            inputs = self.tokenizer(post.text, return_tensors="pt", truncation=True, max_length=512)
            inputs = {k: v.to(self.device) for k, v in inputs.items()}

            with torch.no_grad():
                outputs = self.sentiment_model(**inputs)
                probs = torch.softmax(outputs.logits, dim=1)[0]

            sentiment = {
                'positive': probs[0].item(),
                'neutral': probs[1].item(),
                'negative': probs[2].item()
            }

            results.append({
                'post_id': post.id,
                'language': lang,
                'entities': entities,
                'sentiment': sentiment,
                'processed_at': time.time()
            })

        return results

Batch size is 32 posts. At 60ms per post, GPU can process ~16 posts/second. 80 workers × 16 posts/sec = 1,280 posts/sec theoretical max. We run at 667 posts/sec = 52% capacity.

Storage Strategy

TimescaleDB for everything. It's PostgreSQL with time-series optimizations.

CREATE TABLE posts (
    time            TIMESTAMPTZ NOT NULL,
    platform        TEXT NOT NULL,
    post_id         TEXT NOT NULL,
    author_id       TEXT,
    text            TEXT,
    language        TEXT,
    sentiment_pos   FLOAT,
    sentiment_neg   FLOAT,
    sentiment_neu   FLOAT,
    entities        JSONB,
    embedding       vector(768),  -- Sentence embedding for similarity search

    PRIMARY KEY (time, post_id)
);

-- Create hypertable (automatic time-based partitioning)
SELECT create_hypertable('posts', 'time', chunk_time_interval => INTERVAL '1 day');

-- Indexes
CREATE INDEX idx_platform ON posts (platform, time DESC);
CREATE INDEX idx_author ON posts (author_id, time DESC);
CREATE INDEX idx_entities ON posts USING GIN (entities);
CREATE INDEX idx_embedding ON posts USING ivfflat (embedding vector_cosine_ops);

-- Compression (automatically compress data older than 7 days)
ALTER TABLE posts SET (
    timescaledb.compress,
    timescaledb.compress_segmentby = 'platform',
    timescaledb.compress_orderby = 'time DESC'
);

SELECT add_compression_policy('posts', INTERVAL '7 days');

Current database size: 47TB. Query performance is excellent thanks to automatic partitioning. Historical queries ("show all posts mentioning 'candidate X' in October 2023") run in ~200ms.

Real-Time Queries

PostgreSQL is great for batch analytics but too slow for real-time dashboards. We use Redis for hot data (last 24 hours).

# NLP worker writes to both PostgreSQL and Redis
def store_results(results):
    # Write to PostgreSQL (durable storage)
    db.execute("INSERT INTO posts VALUES (...)", results)

    # Write to Redis (fast queries)
    for result in results:
        # Sorted set by timestamp
        redis.zadd(f"posts:{result.platform}", {result.post_id: result.time})

        # Hash for post details
        redis.hset(f"post:{result.post_id}", mapping=result)

        # Expire after 24 hours
        redis.expire(f"post:{result.post_id}", 86400)

        # Update real-time counters
        redis.hincrby("stats:posts_by_platform", result.platform, 1)
        redis.hincrby("stats:posts_by_sentiment", result.sentiment_label, 1)

Dashboard queries hit Redis first, fall back to PostgreSQL for historical data.

Duplicate Detection

Posts get reposted, retweeted, cross-posted. We deduplicate using MinHash LSH.

from datasketch import MinHash, MinHashLSH

# Initialize LSH index
lsh = MinHashLSH(threshold=0.85, num_perm=128)

def check_duplicate(post):
    # Create MinHash signature
    m = MinHash(num_perm=128)
    for word in post.text.lower().split():
        m.update(word.encode('utf8'))

    # Check for similar posts
    similar = lsh.query(m)

    if similar:
        # Duplicate found, store reference only
        return {
            'is_duplicate': True,
            'original_id': similar[0],
            'similarity': calculate_jaccard(post.text, get_text(similar[0]))
        }
    else:
        # New post, add to index
        lsh.insert(post.id, m)
        return {'is_duplicate': False}

# MinHash LSH is O(1) for lookups, handles 2.4M posts/hour easily

Threshold of 0.85 means posts need 85%+ text similarity to be considered duplicates. Catches exact reposts and minor variations ("RT: ..." prefixes, etc.).

Monitoring

We track everything:

Ingestion rate (posts/sec per platform)
Processing latency (p50, p95, p99)
Error rates (failed parses, model errors)
Queue depth (Kafka lag)
Resource usage (CPU, GPU, memory per worker)
Database performance (query times, index sizes)

Metrics go to Prometheus, visualized in Grafana. Alerts via PagerDuty if anything breaks.

# Example metrics exposed by NLP workers
posts_processed_total{platform="twitter"} 1847293
posts_processed_total{platform="reddit"} 892847

processing_duration_seconds{quantile="0.5"} 0.058
processing_duration_seconds{quantile="0.95"} 0.143
processing_duration_seconds{quantile="0.99"} 0.287

model_inference_duration_seconds{model="sentiment"} 0.045
model_inference_duration_seconds{model="ner"} 0.012

kafka_consumer_lag{partition="0"} 142
kafka_consumer_lag{partition="1"} 89

Cost Breakdown

Monthly AWS bill: $31,247

80× g4dn.xlarge (NLP workers): $20,960 (67% of total)
3× r5.2xlarge (Kafka brokers): $3,240
1× r5.4xlarge (TimescaleDB): $2,419
50TB EBS storage: $5,000
Data transfer: $1,200
Misc (NAT gateways, load balancers): $428

At ~1.73B posts per month, that's ~$0.000018 per post. Could optimize by using Spot instances (save ~70% on GPU costs) but we value stability over savings.

Lessons Learned

1. Kafka is worth the complexity
Decoupling ingestion from processing saved us multiple times. During peak events, Kafka absorbed spikes while workers caught up. Without it, we'd lose data or need massive overprovisioning.

2. Batch processing is faster than you think
We started with real-time processing (one post at a time). Switching to batches (32 posts) improved throughput 3x due to GPU parallelization.

3. Monitoring prevents outages
We've had zero unplanned downtime in 8 months because we catch issues before they cascade. High Kafka lag? Add workers. Slow queries? Add indexes. Problems are obvious if you measure.

4. Time-series databases are underrated
TimescaleDB turned PostgreSQL into a time-series powerhouse. Automatic partitioning, compression, and retention policies handle data growth without manual intervention.

Infrastructure questions: contact@ai-analytics.org