Technical writing
How We Process 2.4M Social Media Posts Per Hour
Our OSINT platform runs 24/7, processing ~58 million social media posts per day. This is a deep dive into the infrastructure that makes it possible.
The Scale
2.4 million posts per hour = 667 posts per second sustained. Peak traffic is 3x that (~2,000 posts/sec) during major events. The system has to handle both the steady state and the spikes.
Each post requires:
- Data extraction and normalization
- Language detection
- Entity recognition (people, places, organizations)
- Sentiment classification
- Duplicate detection
- Storage and indexing
Total processing time: ~60ms per post. At 667 posts/sec, that's 40 CPU-seconds of work per wall-clock second. Parallelization is mandatory.
Infrastructure Overview
Scrapers (47 workers)
↓
Kafka (24 partitions, 7-day retention)
↓
NLP Workers (80 GPU instances)
↓
TimescaleDB (primary storage)
↓
Redis (caching + real-time queries)
↓
API Layer (read-only, public facing)Everything runs on AWS. Total infrastructure cost: $31K/month.
Data Ingestion
We scrape 47 platforms. Mix of official APIs and browser automation.
Twitter: Rate limited to 5,000 posts per 15 minutes per app. We run 40 apps (40 authenticated accounts) in rotation. Each scraper polls every 30 seconds. Gives us ~800K tweets/hour capacity. In practice, we filter to election-relevant keywords and get ~300K/hour.
Facebook: Official API is worthless (only public pages, heavily rate-limited). We use Playwright to drive headless Chrome. Each scraper logs in as a real user and scrolls feeds. Rotating proxies to avoid bans. Hacky but works. ~150K posts/hour.
Reddit: Excellent API. Using Pushshift dumps for historical data. Real-time via streaming API. ~200K posts/hour.
TikTok: Nightmare to scrape. No official API. Their anti-bot measures are sophisticated. We use residential proxies + device fingerprinting to look like mobile apps. Success rate ~60%. ~100K posts/hour.
Remaining platforms (Instagram, YouTube, Telegram, Discord, forums, etc.): ~1.8M posts/hour combined.
Kafka Configuration
All scrapers push to Kafka. This decouples ingestion from processing—critical for handling spikes.
# Kafka topic configuration topic: social_media_posts partitions: 24 replication_factor: 3 retention: 7 days (168 hours) compression: lz4 # Producer settings (scrapers) acks: 1 # Leader acknowledge, don't wait for replicas compression_type: lz4 batch_size: 100KB linger_ms: 50 # Wait up to 50ms to batch messages # Consumer settings (NLP workers) group_id: nlp_workers auto_offset_reset: earliest max_poll_records: 500
24 partitions chosen to match AWS availability zones (8 zones × 3 brokers). Replication factor of 3 means we can lose 2 brokers without data loss.
7-day retention is crucial. When we update NLP models, we replay the last 7 days through the new pipeline. Instant reprocessing of recent data.
NLP Worker Design
80 workers, each running on g4dn.xlarge (4 vCPU, 16GB RAM, 1 NVIDIA T4 GPU).
Each worker:
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer
class NLPWorker:
def __init__(self):
# Load models once at startup
self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# Sentiment model (DistilBERT)
self.sentiment_model = AutoModelForSequenceClassification.from_pretrained(
"distilbert-election-sentiment"
).to(self.device)
# NER model (SpaCy)
self.ner = spacy.load("en_election_ner")
# Language detector (FastText)
self.lang_detector = fasttext.load_model("lid.176.bin")
def process_batch(self, posts):
"""Process a batch of posts"""
results = []
for post in posts:
# 1. Detect language (3ms)
lang = self.lang_detector.predict(post.text)[0][0]
if lang not in ['en', 'es', 'zh']:
continue # Skip non-target languages
# 2. Extract entities (12ms)
doc = self.ner(post.text)
entities = {
'people': [e.text for e in doc.ents if e.label_ == 'PERSON'],
'orgs': [e.text for e in doc.ents if e.label_ == 'ORG'],
'locations': [e.text for e in doc.ents if e.label_ == 'GPE']
}
# 3. Sentiment classification (45ms)
inputs = self.tokenizer(post.text, return_tensors="pt", truncation=True, max_length=512)
inputs = {k: v.to(self.device) for k, v in inputs.items()}
with torch.no_grad():
outputs = self.sentiment_model(**inputs)
probs = torch.softmax(outputs.logits, dim=1)[0]
sentiment = {
'positive': probs[0].item(),
'neutral': probs[1].item(),
'negative': probs[2].item()
}
results.append({
'post_id': post.id,
'language': lang,
'entities': entities,
'sentiment': sentiment,
'processed_at': time.time()
})
return resultsBatch size is 32 posts. At 60ms per post, GPU can process ~16 posts/second. 80 workers × 16 posts/sec = 1,280 posts/sec theoretical max. We run at 667 posts/sec = 52% capacity.
Storage Strategy
TimescaleDB for everything. It's PostgreSQL with time-series optimizations.
CREATE TABLE posts (
time TIMESTAMPTZ NOT NULL,
platform TEXT NOT NULL,
post_id TEXT NOT NULL,
author_id TEXT,
text TEXT,
language TEXT,
sentiment_pos FLOAT,
sentiment_neg FLOAT,
sentiment_neu FLOAT,
entities JSONB,
embedding vector(768), -- Sentence embedding for similarity search
PRIMARY KEY (time, post_id)
);
-- Create hypertable (automatic time-based partitioning)
SELECT create_hypertable('posts', 'time', chunk_time_interval => INTERVAL '1 day');
-- Indexes
CREATE INDEX idx_platform ON posts (platform, time DESC);
CREATE INDEX idx_author ON posts (author_id, time DESC);
CREATE INDEX idx_entities ON posts USING GIN (entities);
CREATE INDEX idx_embedding ON posts USING ivfflat (embedding vector_cosine_ops);
-- Compression (automatically compress data older than 7 days)
ALTER TABLE posts SET (
timescaledb.compress,
timescaledb.compress_segmentby = 'platform',
timescaledb.compress_orderby = 'time DESC'
);
SELECT add_compression_policy('posts', INTERVAL '7 days');Current database size: 47TB. Query performance is excellent thanks to automatic partitioning. Historical queries ("show all posts mentioning 'candidate X' in October 2023") run in ~200ms.
Real-Time Queries
PostgreSQL is great for batch analytics but too slow for real-time dashboards. We use Redis for hot data (last 24 hours).
# NLP worker writes to both PostgreSQL and Redis
def store_results(results):
# Write to PostgreSQL (durable storage)
db.execute("INSERT INTO posts VALUES (...)", results)
# Write to Redis (fast queries)
for result in results:
# Sorted set by timestamp
redis.zadd(f"posts:{result.platform}", {result.post_id: result.time})
# Hash for post details
redis.hset(f"post:{result.post_id}", mapping=result)
# Expire after 24 hours
redis.expire(f"post:{result.post_id}", 86400)
# Update real-time counters
redis.hincrby("stats:posts_by_platform", result.platform, 1)
redis.hincrby("stats:posts_by_sentiment", result.sentiment_label, 1)Dashboard queries hit Redis first, fall back to PostgreSQL for historical data.
Duplicate Detection
Posts get reposted, retweeted, cross-posted. We deduplicate using MinHash LSH.
from datasketch import MinHash, MinHashLSH
# Initialize LSH index
lsh = MinHashLSH(threshold=0.85, num_perm=128)
def check_duplicate(post):
# Create MinHash signature
m = MinHash(num_perm=128)
for word in post.text.lower().split():
m.update(word.encode('utf8'))
# Check for similar posts
similar = lsh.query(m)
if similar:
# Duplicate found, store reference only
return {
'is_duplicate': True,
'original_id': similar[0],
'similarity': calculate_jaccard(post.text, get_text(similar[0]))
}
else:
# New post, add to index
lsh.insert(post.id, m)
return {'is_duplicate': False}
# MinHash LSH is O(1) for lookups, handles 2.4M posts/hour easilyThreshold of 0.85 means posts need 85%+ text similarity to be considered duplicates. Catches exact reposts and minor variations ("RT: ..." prefixes, etc.).
Monitoring
We track everything:
- Ingestion rate (posts/sec per platform)
- Processing latency (p50, p95, p99)
- Error rates (failed parses, model errors)
- Queue depth (Kafka lag)
- Resource usage (CPU, GPU, memory per worker)
- Database performance (query times, index sizes)
Metrics go to Prometheus, visualized in Grafana. Alerts via PagerDuty if anything breaks.
# Example metrics exposed by NLP workers
posts_processed_total{platform="twitter"} 1847293
posts_processed_total{platform="reddit"} 892847
processing_duration_seconds{quantile="0.5"} 0.058
processing_duration_seconds{quantile="0.95"} 0.143
processing_duration_seconds{quantile="0.99"} 0.287
model_inference_duration_seconds{model="sentiment"} 0.045
model_inference_duration_seconds{model="ner"} 0.012
kafka_consumer_lag{partition="0"} 142
kafka_consumer_lag{partition="1"} 89Cost Breakdown
Monthly AWS bill: $31,247
- 80× g4dn.xlarge (NLP workers): $20,960 (67% of total)
- 3× r5.2xlarge (Kafka brokers): $3,240
- 1× r5.4xlarge (TimescaleDB): $2,419
- 50TB EBS storage: $5,000
- Data transfer: $1,200
- Misc (NAT gateways, load balancers): $428
At ~1.73B posts per month, that's ~$0.000018 per post. Could optimize by using Spot instances (save ~70% on GPU costs) but we value stability over savings.
Lessons Learned
1. Kafka is worth the complexity
Decoupling ingestion from processing saved us multiple times. During peak events, Kafka absorbed spikes while workers caught up. Without it, we'd lose data or need massive overprovisioning.
2. Batch processing is faster than you think
We started with real-time processing (one post at a time). Switching to batches (32 posts) improved throughput 3x due to GPU parallelization.
3. Monitoring prevents outages
We've had zero unplanned downtime in 8 months because we catch issues before they cascade. High Kafka lag? Add workers. Slow queries? Add indexes. Problems are obvious if you measure.
4. Time-series databases are underrated
TimescaleDB turned PostgreSQL into a time-series powerhouse. Automatic partitioning, compression, and retention policies handle data growth without manual intervention.
Infrastructure questions: contact@ai-analytics.org