Technical writing

NLP Pipeline for Real-Time Sentiment Analysis at Scale

September 28, 2024· 11 min read· AI Analytics

NLPTensorFlowInfrastructureOSINT

Our OSINT platform processes 2.4 million social media posts per hour across 47 platforms. This isn't a demo project—it runs 24/7 monitoring election-related discourse. Here's how we built it.

Requirements

The system needs to:

Ingest posts from Twitter, Facebook, Instagram, TikTok, Reddit, etc.
Extract entities (people, organizations, locations)
Classify sentiment (positive/negative/neutral + intensity)
Detect coordinated campaigns (bot networks, astroturfing)
Store results for historical analysis
Alert on anomalies in real-time

All with <2 second latency from post to processed result. At 2.4M posts/hour, that's 667 posts per second.

Architecture

┌─────────────┐
│  Scrapers   │ (47 platform connectors)
└──────┬──────┘
       │
       ▼
┌─────────────┐
│ Kafka Queue │ (24 partitions)
└──────┬──────┘
       │
       ▼
┌─────────────┐
│ NLP Workers │ (80 parallel instances)
└──────┬──────┘
       │
       ▼
┌─────────────┐
│ TimescaleDB │ (time-series storage)
└─────────────┘

Data Ingestion

Each platform has a custom scraper. Some use official APIs, others scrape HTML. Twitter's API is rate-limited (5,000 posts/15min), so we run 40 authenticated accounts in rotation. Facebook requires browser automation (Playwright) because their API is deliberately crippled.

All scrapers push to Apache Kafka. This decouples ingestion from processing—if NLP workers go down, we don't lose data. Kafka retains 7 days of posts for replay during model updates.

NLP Models

We run three models in sequence:

1. Language Detection
FastText model trained on 176 languages. Takes 3ms per post. We only process English, Spanish, and Chinese—everything else is stored but not analyzed (adds coverage later if needed).

2. Named Entity Recognition
Custom SpaCy model trained on political text. Extracts people, organizations, and locations. Fine-tuned on 2.3M labeled examples from previous election cycles. Takes 12ms per post.

3. Sentiment Classification
DistilBERT fine-tuned on 5M political tweets. Three-class output (positive/negative/neutral) plus confidence score. Takes 45ms per post on GPU.

Total processing time: ~60ms per post. We run 80 workers (each with 1 GPU), giving us theoretical throughput of 1.33M posts/second. In practice, we hit 2.4M posts/hour = 667 posts/second, so we're at 50% capacity with room for traffic spikes.

Coordinated Campaign Detection

This is the interesting part. Sentiment analysis is table stakes—detecting coordinated manipulation is where the value is.

We look for:

Temporal clustering: 100+ accounts posting similar content within 60 seconds
Linguistic similarity: Posts with >85% text overlap (accounting for templating)
Account age patterns: Networks of accounts created within days of each other
Behavioral anomalies: Posting at inhuman speeds, identical typos, shared URLs

We use MinHash LSH (Locality-Sensitive Hashing) for fast similarity detection. Every post gets hashed into 128 buckets. Similar posts collide in the same buckets. This lets us find duplicates in O(1) time instead of comparing every post to every other post.

from datasketch import MinHash, MinHashLSH

lsh = MinHashLSH(threshold=0.85, num_perm=128)

for post in posts:
    m = MinHash(num_perm=128)
    for word in post.text.split():
        m.update(word.encode('utf8'))

    # Check for similar posts
    result = lsh.query(m)
    if len(result) > 10:  # Cluster detected
        flag_coordinated_campaign(post, result)

Storage

Raw posts go into TimescaleDB (PostgreSQL extension for time-series data). We use hypertables with 1-day chunks. Old data automatically compresses.

Schema:

CREATE TABLE posts (
  time        TIMESTAMPTZ NOT NULL,
  platform    TEXT NOT NULL,
  post_id     TEXT NOT NULL,
  author_id   TEXT NOT NULL,
  text        TEXT,
  language    TEXT,
  sentiment   FLOAT,
  entities    JSONB,
  campaign_id UUID,  -- NULL if not part of campaign

  PRIMARY KEY (time, post_id)
);

SELECT create_hypertable('posts', 'time');
CREATE INDEX idx_campaign ON posts (campaign_id);
CREATE INDEX idx_entities ON posts USING GIN (entities);

Current storage: 47TB (14 months of data). Query performance is excellent thanks to time-based partitioning. Historical queries ("show me all negative posts about candidate X in October 2023") run in <200ms.

Real-Time Alerting

We use Redis Streams for real-time alerts. When the system detects:

Sudden sentiment shift (>20% change in 1 hour)
Coordinated campaign (>50 accounts)
Viral spread (post shared >10K times in <30 min)
New entity mentioned >500 times (breaking news)

...it pushes to Redis. Monitoring dashboard polls Redis every 5 seconds. Critical alerts trigger webhooks (Slack, email, SMS).

Performance Numbers

Production metrics (30-day average):

Throughput: 2.4M posts/hour sustained
Latency: 1.8 second median (ingestion to storage)
Accuracy: 94.7% sentiment classification (vs human labelers)
Campaign detection: 89% precision, 76% recall
Uptime: 99.8% (3 unplanned outages, all <20 min)

Costs

Infrastructure runs on AWS:

80x g4dn.xlarge instances (NLP workers): $21K/month
Kafka cluster (3x r5.2xlarge): $3.2K/month
TimescaleDB (1x r5.4xlarge + 50TB storage): $4.8K/month
Networking/bandwidth: $2K/month

Total: ~$31K/month or $0.013 per post processed. Could optimize further but performance is more important than cost for this use case.

Lessons Learned

1. Don't over-engineer the models
We started with BERT-large. Switched to DistilBERT and got 90% of the accuracy at 3x the speed. For production systems, fast-and-good beats slow-and-perfect.

2. The pipeline is the product
Spend time on data ingestion, storage, and monitoring. The NLP is almost commodity at this point. The hard part is running it reliably at scale.

3. Platform APIs are adversarial
Twitter, Facebook, etc. actively work to limit data access. Build scrapers that look like real users. Rotate IPs. Don't trust APIs to stay stable.

4. Design for replay
Kafka retention saved us multiple times. When we updated models, we replayed 7 days of posts through the new pipeline. Instant historical correction.

Full pipeline architecture available on request: contact@ai-analytics.org