Technical writing

Building a Digital-Footprint Reconnaissance Pipeline for OSINT Investigations

October 28, 2024· 16 min read· AI Analytics

OSINTReconnaissanceEntity resolutionInfrastructure

Several of Voidly's research use cases require building a persistent, cross-platform profile of an entity — a government agency, an ISP, an individual operator, or an organization suspected of coordinating censorship. This is not the same as a one-off search: we need a system that continuously monitors an entity's digital footprint across 40+ sources, resolves identity across platforms where the entity may use different names, and surfaces changes without requiring manual re-investigation.

This article documents the architecture of that internal pipeline. We call it the digital-footprint reconnaissance tool. It is not separately released — the specific targeting is too operationally sensitive — but the techniques are standard OSINT practice and documenting them is useful for researchers building similar systems.

What "digital footprint" means in practice

An entity's digital footprint is the aggregate of all observable data points that can be linked to it across public sources:

Source category     Examples                               Coverage
────────────────────────────────────────────────────────────────────────
Social media        Twitter/X, LinkedIn, GitHub, Mastodon  Name / handle
Public registries   WHOIS, BGP, ARIN, RIPE, APNIC          Network identifiers
Corporate records   SEC EDGAR, Companies House, OpenCorporates  Legal entities
Academic            Google Scholar, ORCID, ResearchGate    Publications
Government          USAspending, FEC, SAM.gov, regulations.gov  Affiliations
Communication       Email headers, PGP keyservers          Cryptographic keys
Technical           Shodan, Censys, Certificate Transparency  Infra fingerprints
News / web          Common Crawl, news archives, Wayback   Historical mentions

The challenge is that the same entity may appear under different names across sources, may deliberately use pseudonyms, and may share identifiers (like IP addresses) with unrelated entities. A naive union of all records mentioning a name produces an unusable quantity of false positives.

Entity model

The central object is an Entity with a canonical ID and a set of verified Attributes. Each attribute has a source, a confidence score, and a timestamp. The entity model is persistent: attributes accumulate over time and are never deleted (only timestamped as stale if superseded).

struct Entity {
    id: Uuid,
    canonical_name: String,
    entity_type: EntityType,   // Person | Organization | Network | ...
    attributes: Vec<Attribute>,
    aliases: Vec<String>,
    related_entities: Vec<EntityRelation>,
    confidence: f32,           // 0.0–1.0, updated on each new evidence
    created_at: DateTime,
    last_updated: DateTime,
}

struct Attribute {
    key: AttributeKey,         // Email | Handle | CIDR | AutonomousSystem | ...
    value: String,
    source: DataSource,
    source_url: String,
    retrieved_at: DateTime,
    confidence: f32,
    stale_after: Option<DateTime>,
}

enum AttributeKey {
    Email,
    SocialHandle { platform: Platform },
    PhoneE164,
    CidrBlock,
    AutonomousSystem,
    PgpFingerprint,
    DomainName,
    CompanyRegistrationNumber { jurisdiction: String },
    SecCik,
    FecCommitteeId,
    OrcidId,
    // ...
}

Collection layer

Collection is passive — we do not interact with or probe target infrastructure in ways that might alert the subject. Every data point comes from public sources that we have a legitimate right to access. The pipeline has 40+ source connectors:

// Source connector interface
trait SourceConnector: Send + Sync {
    fn name(&self) -> &str;
    fn rate_limit(&self) -> RateLimit;

    async fn search(
        &self,
        query: &EntityQuery,
        client: &RateLimitedClient,
    ) -> Result<Vec<RawRecord>>;

    async fn enrich(
        &self,
        entity: &Entity,
        client: &RateLimitedClient,
    ) -> Result<Vec<Attribute>>;
}

// Example: WHOIS connector
impl SourceConnector for WhoisConnector {
    fn name(&self) -> &str { "whois" }

    fn rate_limit(&self) -> RateLimit {
        RateLimit { requests_per_minute: 30, burst: 5 }
    }

    async fn enrich(&self, entity: &Entity, client: &RateLimitedClient)
    -> Result<Vec<Attribute>> {
        let domains = entity.attributes_of(AttributeKey::DomainName);
        let mut attrs = vec![];

        for domain in domains {
            let whois = client.whois(&domain.value).await?;
            if let Some(registrant_email) = whois.registrant_email {
                attrs.push(Attribute {
                    key: AttributeKey::Email,
                    value: registrant_email,
                    source: DataSource::Whois,
                    confidence: 0.7,   // Registrant data is often redacted or fake
                    ..Attribute::now()
                });
            }
        }
        Ok(attrs)
    }
}

Rate limiting is enforced per connector, not globally. Different sources have very different rate limits (Twitter: 15 req/15min per account; ARIN WHOIS: 60 req/min; Certificate Transparency logs: effectively unlimited). Each connector manages its own token bucket and backs off automatically on 429 responses.

For sources requiring authentication (LinkedIn, GitHub, some APIs), we maintain pools of authenticated accounts rotated across requests. Accounts are never used in ways that violate terms of service — we stay within API rate limits and do not use browser automation for sources that provide official APIs.

Identity disambiguation: the hard problem

The core challenge: given a set of records from different sources, determine which ones refer to the same entity. A naive approach (merge everything with the same name) produces catastrophic false positives — "John Smith" matches thousands of unrelated people. The correct approach models identity resolution as a graph problem.

The entity graph

Every observed identifier becomes a node. Observations linking two identifiers become weighted edges. An entity is a connected component of this graph, pruned by confidence thresholds.

// Evidence types and their edge weights
enum EvidenceType {
    SameProfile,          // Both identifiers on the same profile page   → 0.95
    ExplicitLink,         // Profile links to another profile            → 0.90
    SharedEmail,          // Two accounts verified to same email         → 0.88
    SharedPgpKey,         // Two accounts with same PGP fingerprint      → 0.92
    SharedIp,             // Two accounts logged in from same IP         → 0.40  (VPNs exist)
    SharedWritingStyle,   // Stylometric match p < 0.01                  → 0.55
    NameVariant,          // "John Smith" ~ "J. Smith" ~ "@jsmith"       → 0.20
    SharedOrg,            // Both affiliated with same org               → 0.15
}

// Entity component extraction
fn extract_entity_components(graph: &EntityGraph, threshold: f32) -> Vec<Entity> {
    // Remove edges below confidence threshold
    let pruned = graph.filter_edges(|e| e.weight >= threshold);

    // Connected components = candidate entities
    let components = pruned.connected_components();

    components.into_iter().map(|nodes| {
        Entity {
            confidence: nodes.iter()
                .flat_map(|n| n.incoming_edges())
                .map(|e| e.weight)
                .fold(0.0_f32, f32::max),
            attributes: nodes.into_iter()
                .flat_map(|n| n.attributes)
                .collect(),
            // ...
        }
    }).collect()
}

Confidence calibration

Edge weights are calibrated against a labeled test set of 2,000 entity pairs (500 confirmed same-entity, 1,500 confirmed different-entity) verified through ground truth from public records (e.g., a journalist who is publicly known to use specific social handles). The calibration is updated quarterly as new sources are added and existing source reliability changes.

Some evidence types are asymmetric. A shared IP address is weak evidence of shared identity (VPNs and shared office networks produce false positives). A shared PGP key is strong (key generation requires physical access to the generating device in most workflows). A name variant is very weak (common names) but a combination of name variant + shared organization + shared time zone pattern is much stronger.

Stylometric analysis

For entities that produce significant text (bloggers, researchers, operators running public communications channels), we apply stylometric fingerprinting. The features:

struct StyleProfile {
    // Lexical
    avg_sentence_length: f32,
    avg_word_length: f32,
    type_token_ratio: f32,          // vocabulary richness
    punctuation_freq: HashMap<char, f32>,

    // Syntactic (POS tag distribution)
    pos_distribution: HashMap<PosTag, f32>,
    avg_parse_tree_depth: f32,

    // Character-level
    char_bigram_distribution: HashMap<(char, char), f32>,

    // Temporal
    posting_hour_distribution: [f32; 24],  // when do they post?
    avg_reply_latency_minutes: f32,

    // Error patterns (distinctive typos, consistent misspellings)
    consistent_errors: Vec<String>,
}

fn stylometric_similarity(a: &StyleProfile, b: &StyleProfile) -> f32 {
    // Cosine similarity over a concatenated feature vector
    // Calibrated: p < 0.01 threshold → 0.55 edge weight
    let a_vec = a.to_feature_vector();
    let b_vec = b.to_feature_vector();
    cosine_similarity(&a_vec, &b_vec)
}

Stylometry is useful when a subject is suspected of operating multiple pseudonymous accounts. Its limitations: it requires a substantial text corpus (at least 500 words per identity), it is confounded by translation (a non-native English speaker may have distinctive non-native patterns), and deliberate obfuscation (varying writing style intentionally) can defeat it. We treat stylometric evidence as corroborating, never as the primary basis for entity linkage.

Certificate Transparency as a passive source

Certificate Transparency (CT) logs are one of the most underutilized OSINT sources. Every TLS certificate issued by a publicly trusted CA is logged to at least two CT logs. Each entry contains the domain name(s), the issuing CA, the subject organization, and the timestamp. This means that any new domain an entity registers will appear in CT logs within minutes of certificate issuance, before the site is even accessible.

// Monitor CT logs for domains matching entity patterns
async fn monitor_ct_logs(
    entity: &Entity,
    certstream: &mut CertStreamClient,
) -> impl Stream<Item = CertEntry> + '_ {
    // Build pattern set from known entity domains
    let patterns: Vec<DomainPattern> = entity
        .attributes_of(AttributeKey::DomainName)
        .iter()
        .flat_map(|a| {
            let domain = a.value.as_str();
            vec![
                DomainPattern::ExactMatch(domain.to_string()),
                DomainPattern::SubdomainOf(domain.to_string()),
                // Typosquats: 1-edit-distance variants of known domains
                DomainPattern::EditDistance1(domain.to_string()),
            ]
        })
        .collect();

    certstream
        .stream()
        .filter(move |cert| patterns.iter().any(|p| p.matches(&cert.domain)))
}

// CertStream (certstream.calidog.io) provides real-time CT log aggregation
// We run a local consumer that fans out matches to entity monitors

In practice, CT log monitoring has caught infrastructure re-registration events that would have been missed by periodic polling. When an entity decommissions a domain and a new entity (potentially a successor or a related actor) registers a similar domain weeks later, the CT event surfaces immediately.

BGP and ASN tracking

For entities that operate network infrastructure — ISPs, CDN operators, government network agencies — we track their BGP presence. RIPE NCC, ARIN, and APNIC all publish Routing Information Base (RIB) dumps every 8 hours. We diff consecutive RIB snapshots to detect:

New IP prefixes announced by a tracked AS
Prefix transfers between ASNs (may indicate infrastructure acquisition)
Route withdrawals (may indicate decommissioning or seizure)
Origin AS changes (may indicate routing hijack or traffic redirection)

// RIB diff against tracked ASNs
async fn process_rib_diff(
    prev_rib: &Rib,
    curr_rib: &Rib,
    tracked_asns: &HashSet<u32>,
) -> Vec<BgpEvent> {
    let mut events = vec![];

    for prefix in curr_rib.prefixes_not_in(prev_rib) {
        if tracked_asns.contains(&prefix.origin_as) {
            events.push(BgpEvent::NewAnnouncement {
                asn: prefix.origin_as,
                prefix: prefix.network,
                first_seen: curr_rib.timestamp,
            });
        }
    }

    for prefix in prev_rib.prefixes_not_in(curr_rib) {
        if tracked_asns.contains(&prefix.origin_as) {
            events.push(BgpEvent::Withdrawal {
                asn: prefix.origin_as,
                prefix: prefix.network,
                last_seen: prev_rib.timestamp,
            });
        }
    }

    events
}

// Events feed back into the entity's attribute set
// and also into Voidly's BGP shutdown detection pipeline

Operational security for researchers

When the subject of reconnaissance is a state actor or a hostile organization, operational security for the researcher conducting the investigation is paramount. The pipeline is designed so that all collection happens from infrastructure that is not attributable to individual researchers:

All collection requests are routed through a pool of residential proxies and VPN exit nodes. No collection happens from researcher IP addresses directly.
Social media accounts used for collection are registered to non-attributable identities and aged (not freshly created accounts that would attract bot detection).
API credentials are rotated monthly. No credential is used from the same IP for more than 30 days.
The pipeline infrastructure itself runs in a separate cloud account with no linking to researcher personal accounts or payment methods.
Research targets are stored with a codename; the real identity is maintained in a separate encrypted store accessible only to the lead researcher.

// Compartmentalized entity storage
struct EntityStore {
    // Public: accessible to all pipeline workers
    // Contains only codenames and derived attributes
    public_db: TimescaleDB,

    // Restricted: accessible only to authorized researchers
    // Maps codenames to real identities
    restricted_db: EncryptedPostgres,  // encrypted at rest, audit-logged
}

// Pipeline workers see only codenames:
// "ENTITY_7a3f" → { asns: [12345], domains: ["example.com"], ... }
// Real name "ACME Corp" is only in restricted_db

Integration with Voidly censorship monitoring

The footprint pipeline feeds directly into Voidly's censorship event analysis. When a censorship event is confirmed — a specific platform blocked by a specific ISP in a specific country — we run entity lookups for:

The ISP's ASN (BGP events, route changes at time of block)
The blocking infrastructure's IP ranges (which proxy is intercepting traffic?)
The government ministry responsible for that country's internet regulation (SEC-equivalent filings, press releases, official communications)
Known proxy organizations (NGOs, think tanks, or private companies that front for state censorship infrastructure)

This attribution layer is what transforms a raw network measurement ("TCP connection to example.com failed from this ASN at this time") into an incident with context ("ACME Telecom, subsidiary of State Holdings Ltd, began blocking access to example.com using a transparent proxy at IP 1.2.3.4, injecting RST packets 8ms after SYN").

Ethical constraints

The pipeline is restricted to targets that meet three criteria: (1) the target is an organization or a public figure acting in an official capacity, not a private individual; (2) there is documented public interest in the investigation (evidence of censorship, fraud, or human rights violations); and (3) collection methods are passive and confined to publicly accessible data.

We do not sell entity profiles or share them with commercial intelligence firms. Profiles are shared only with credentialed journalists and human rights researchers under data sharing agreements that restrict downstream use. This constraint is in the governance document and enforced operationally by the compartmentalization architecture described above.

For how BGP events from entity-tracked ASNs feed Voidly's internet shutdown detection: BGP routing signals and internet shutdown detection: how Voidly uses IODA data →

For the NLP pipeline that processes social media posts about censorship events, feeding the contextual layer of incident attribution: NLP pipeline for real-time sentiment analysis at scale →

For the OONI historical corpus used as a training dataset for censorship classifiers, which anchors the footprint pipeline's entity attribution: Building the OONI historical corpus: 1.66M downloads, schema normalization →

For how the infrastructure behind censorship events is mapped — DPI vendor signatures, TTL hop-count analysis, and OSINT cross-referencing of procurement contracts: Censorship infrastructure mapping: DPI vendor signatures, block page fingerprints, and OSINT procurement cross-referencing →

Censorship attribution via OSINT applies the same OSINT reconnaissance pipeline to attribute internet censorship to specific DPI vendors via procurement records, RST timing, and block page signatures.