Technical writing

Swarm SDK gossip mesh: bounded fanout routing, message deduplication, and network partition handling

· 10 min read· AI Analytics
Swarm SDKCryptographyDroneInfrastructure

Getting a message from one drone to all 63 others in a swarm sounds like a routing problem. It is — but the constraints eliminate most classical solutions. No coordinator can be trusted to stay online. Topology changes every few seconds as drones move at 30 m/s. Ground control may be out of radio range entirely. The Swarm SDK uses a gossip protocol: each node picks k = 3 random peers, relays, and the epidemic spreads. No routing tables, no coordinator, sub-linear message cost.

Why gossip for drone swarms

Classical message distribution in networked systems assumes stable infrastructure: a broker, a coordinator, or at least a static address to which all nodes subscribe. Drone swarms violate every one of these assumptions simultaneously. The ground control station drifts in and out of radio range as terrain changes and drones ascend or descend. Any single drone in the swarm can go down mid-mission — hit a tree, lose power, or be jammed — and the rest must continue operating without reconfiguration. Network topology changes every few seconds as drones move, reorder, and new links form and break.

A centralized broker requires a reliable path to that broker from every node at all times. Flood broadcast — forwarding every message to every neighbor — is reliable but generates O(N²) traffic; at 64 drones exchanging 10 messages per second, that is 40,960 forwarded frames per second before encryption overhead. Neither approach is viable.

Gossip protocols tolerate all of these constraints naturally. Each node needs only a partial view of the network — a list of peers it has recently heard from. When a node has a message to deliver, it forwards to k random peers from its list. Each recipient does the same. The epidemic spreads through the swarm in ⌈logk(N)⌉ hops, using O(N · k · logk(N)) total transmissions — sub-linear in message count as N grows. There is no coordinator to fail. Topology changes are absorbed silently because the peer list is updated on every received message. A drone going offline simply stops appearing in its neighbors' peer lists within 60 seconds and is no longer selected as a relay target.

The choice of gossip also aligns cleanly with the Double Ratchet session layer above it. The gossip mesh routes encrypted MAVLink v2 TUNNEL frames (message_id 385) and never sees plaintext. The routing layer does not need to understand session state, key epochs, or which ratchet chain a payload belongs to — it sees opaque ciphertext blobs and forwards them. This separation of concerns means the gossip layer can be upgraded, debugged, and benchmarked independently of the cryptographic session layer.

The epidemic broadcast model

When a drone has a message to send, it picks k = 3 neighbors at random from its current peer list and sends them the message. Each recipient that has not already seen this message_id picks k = 3 new neighbors and relays. With k = 3 and typical swarm sizes of 8–64 drones, delivery reaches 95% of nodes in ⌈log3(N)⌉ hops:

Hops needed for N drones with fanout k=3:
N=8:   2 hops  (log₃(8)   ≈ 2.0)
N=27:  3 hops  (log₃(27)  = 3.0)
N=64:  4 hops  (log₃(64)  ≈ 3.8)
N=128: 5 hops  (log₃(128) ≈ 4.4)

The core gossip loop in the Swarm SDK runs on a 250 ms timer with ±50 ms jitter. The jitter prevents synchronized bursts when many drones have messages queued simultaneously — without jitter, all nodes in a swarm initialized from the same mission start would tick in lockstep, creating periodic collision spikes on the radio channel. The jitter spreads load across the 250 ms window.

const FANOUT: usize = 3;
const GOSSIP_INTERVAL_MS: u64 = 250;
const GOSSIP_JITTER_MS: u64 = 50;

pub struct GossipNode {
    node_id: [u8; 16],
    seen_ids: VecDeque<MessageId>,  // sliding window, capacity 1000
    peer_list: Vec<PeerInfo>,
    pending_messages: VecDeque<GossipMessage>,
}

impl GossipNode {
    pub fn receive(&mut self, msg: GossipMessage) -> bool {
        if self.seen_ids.contains(&msg.id) {
            return false;  // already seen, don't relay
        }
        if self.seen_ids.len() >= 1000 {
            self.seen_ids.pop_front();
        }
        self.seen_ids.push_back(msg.id);
        self.pending_messages.push_back(msg);
        true  // new message, will relay
    }

    pub async fn gossip_tick(&mut self, transport: &mut MavlinkTransport) {
        let jitter = rand::thread_rng().gen_range(0..GOSSIP_JITTER_MS);
        tokio::time::sleep(Duration::from_millis(GOSSIP_INTERVAL_MS + jitter)).await;

        let peers = self.select_random_peers(FANOUT);
        for msg in self.pending_messages.drain(..) {
            for peer in &peers {
                transport.send(peer, &msg).await?;
            }
        }
    }
}

The select_random_peers call samples without replacement from the current peer list. If the peer list has fewer than k entries — common during swarm formation when not all nodes have been discovered yet — all known peers are selected. This degrades gracefully to direct broadcast when the peer list is sparse, which is exactly the right behavior for early mission phases before the gossip mesh has fully converged.

At 4 gossip ticks per second with k = 3 peers, each drone sends approximately 12 MAVLink frames per second of gossip traffic under steady-state load. The 900 MHz drone radio links common in long-range swarm operations carry 200–500 kbps easily; 12 encrypted TUNNEL frames at roughly 256 bytes each is about 24 kbps — well within budget. 2.4 GHz links have significantly more headroom.

Message deduplication

The epidemic broadcast model creates duplicates by design — the same message arrives via multiple relay paths. Without deduplication, a node would relay a message every time it received it, turning the epidemic into an unbounded flood. The deduplication mechanism must be fast (checked on every received frame), memory-bounded (constrained embedded hardware), and long-lived enough to cover the full propagation time of any message.

Each gossip message carries a globally unique message_id: [u8; 16] (UUIDv4, generated by the originating node using its hardware RNG). The seen_ids field in GossipNode is a VecDeque operating as a sliding window of the last 1,000 message IDs. Lookup is O(N) over the deque — on STM32H7 this costs 0.8 ms at p50, acceptable given the 250 ms gossip interval.

The window capacity of 1,000 was sized as follows: with typical drone-to-drone origination rates of 4–10 messages per second per active pair in a 16-drone swarm, aggregate origination across the swarm peaks at roughly 30 new messages per second. 1,000 IDs provides a ~33-second deduplication window — long enough that no message travels the network for more than a single hop-cascade before all copies of it have been absorbed. The sliding window evicts the oldest ID (pop_front) when the capacity is reached, ensuring memory usage is strictly bounded at 16 KB (1,000 × 16 bytes) regardless of traffic volume.

A HashSet would give O(1) lookup at the cost of non-deterministic memory allocation patterns — problematic on embedded targets where heap fragmentation is a real concern over multi-hour missions. The VecDeque allocates once at initialization and never reallocates within its capacity. The O(N) scan is dominated by cache performance rather than algorithm complexity for N = 1,000 on a Cortex-M7 with its 32 KB L1 data cache; the entire deque fits in cache.

Peer list and peer discovery

For the gossip protocol to function, each node needs a peer list — a set of known addresses to select relay targets from. The Swarm SDK maintains a peer list of up to 32 nodes per gossip instance. The cap of 32 is a deliberate bound: in swarm sizes of 8–64 nodes, 32 peers is either the full swarm (small missions) or a well-connected local neighborhood (large missions), and 32 entries fits comfortably in a single cache line batch.

On startup, the peer list is bootstrapped from the mission cert bundle (peer_certs) loaded at preflight. This bundle is provisioned by the fleet operator and contains the ML-KEM-768 + X25519 public keys and last-known addresses of all drones assigned to the mission. Bootstrapping from cert bundles means the gossip mesh can begin operating immediately without any discovery handshake — a requirement for time-critical mission start sequences.

As the swarm operates, each gossip message header carries the sender's current IP:port pair. On receipt, the receiving node upserts this address into its peer list:

pub struct GossipMessageHeader {
    pub message_id:  [u8; 16],
    pub sender_id:   [u8; 16],
    pub sender_addr: SockAddr,   // current IP:port of sender
    pub hop_count:   u8,
    pub ttl:         u8,         // max 7 hops
    pub timestamp:   u64,        // Unix milliseconds
}

impl GossipNode {
    fn upsert_peer(&mut self, header: &GossipMessageHeader) {
        let now_ms = unix_time_ms();
        if let Some(entry) = self.peer_list
            .iter_mut()
            .find(|p| p.node_id == header.sender_id)
        {
            // Update address (drone may have changed IP after link re-establishment)
            entry.addr     = header.sender_addr.clone();
            entry.last_seen = now_ms;
        } else if self.peer_list.len() < 32 {
            self.peer_list.push(PeerInfo {
                node_id:   header.sender_id,
                addr:      header.sender_addr.clone(),
                last_seen: now_ms,
            });
        } else {
            // Evict the stalest entry to make room
            if let Some(oldest) = self.peer_list
                .iter_mut()
                .min_by_key(|p| p.last_seen)
            {
                *oldest = PeerInfo {
                    node_id:   header.sender_id,
                    addr:      header.sender_addr.clone(),
                    last_seen: now_ms,
                };
            }
        }
    }

    fn expire_peers(&mut self) {
        let cutoff_ms = unix_time_ms() - 60_000;  // 60-second timeout
        self.peer_list.retain(|p| p.last_seen >= cutoff_ms);
    }
}

The 60-second peer expiry is calibrated against mission dynamics. A drone that has genuinely left radio range will stop sending, and its peer list entry will expire within one minute. A drone that temporarily lost connectivity — brief jamming, passing behind terrain — will typically re-establish within 10–30 seconds and be re-added to the peer list on first receipt. The 60-second window accepts the occasional false positive (a reachable drone expiring early) in exchange for guaranteed cleanup of genuinely offline nodes, which is the correct trade-off for a finite peer-list capacity.

Causal ordering with vector clocks

The gossip layer delivers messages with best-effort ordering — packets arrive in the order that relay paths happen to deliver them, which in a dynamic RF mesh is effectively random. For most drone swarm traffic this is fine. Telemetry, position updates, and sensor readings do not have causal dependencies; the most recent value is always the one that matters.

A subset of session-layer messages cannot tolerate this. Sender Keys group key updates and revocation messages have strict causal ordering requirements: a drone that receives a new group key before it has processed the preceding Sender Key rotation will be unable to decrypt subsequent group messages. The gossip layer supports causal ordering for these messages via Lamport clocks:

pub struct CausalMessage {
    pub gossip_header: GossipMessageHeader,
    pub causal_clock:  u64,                      // monotonically increasing per sender
    pub dependencies:  Vec<(NodeId, u64)>,       // (node_id, clock) pairs this msg depends on
    pub payload:       Vec<u8>,                  // encrypted application payload
}

pub struct CausalLayer {
    // Per-sender clock values we have observed
    observed_clocks: HashMap<NodeId, u64>,
    // Messages waiting for their dependencies to arrive
    pending_queue:   Vec<CausalMessage>,
}

impl CausalLayer {
    pub fn receive(&mut self, msg: CausalMessage) -> Vec<CausalMessage> {
        if self.dependencies_satisfied(&msg) {
            self.advance_clock(&msg);
            // Drain any queued messages whose dependencies are now met
            let mut ready = vec![msg];
            ready.extend(self.drain_pending());
            ready
        } else {
            self.pending_queue.push(msg);
            vec![]
        }
    }

    fn dependencies_satisfied(&self, msg: &CausalMessage) -> bool {
        msg.dependencies.iter().all(|(node_id, clock)| {
            self.observed_clocks
                .get(node_id)
                .map_or(false, |observed| *observed >= *clock)
        })
    }
}

Each node maintains a monotonically increasing causal_clock that advances with every CausalMessage it originates. Dependencies are expressed as a vector of (node_id, clock) pairs: “this message depends on having processed all messages from node X up to and including clock value Y.” A receiving node holds causally dependent messages in a pending queue until all declared dependencies have been observed, then delivers the queued messages in dependency order.

In practice, the pending queue is almost always empty. Key management messages are infrequent (Sender Key rotations happen every 7 days; revocations are rare events), and the gossip mesh delivers messages fast enough — typically within 1–2 seconds for a 32-node swarm — that dependencies arrive before the dependent message in the vast majority of cases. The causal ordering mechanism exists to handle the edge cases correctly, not to be on the critical path. The encrypted payload stream itself uses the Double Ratchet's own sequencing and does not require causal ordering at the gossip layer.

TTL and hop limiting

Each gossip message starts with ttl = 7. Each relay node decrements the TTL before forwarding; messages arriving with ttl = 0 are accepted and processed locally but not forwarded. TTL limiting provides a hard bound on message propagation depth, preventing stale messages from circulating indefinitely in the mesh.

With k = 3 and TTL = 7, a 64-drone swarm reaches all nodes in 4 hops — using only 4 of the 7 allowed TTL decrements. The remaining 3 hops of TTL headroom exist for lossy channels: if a relay path requires an extra hop because the direct path is unavailable, the message can still reach its destination. In environments with high packet loss (electronic warfare conditions, terrain obstruction), this headroom can mean the difference between delivery and silence.

impl GossipNode {
    pub fn should_relay(&self, header: &GossipMessageHeader) -> bool {
        // Drop if TTL exhausted
        if header.ttl == 0 {
            return false;
        }
        // Drop if already seen (dedup check done before this call)
        // Drop if the hop_count is suspiciously large (sanity bound)
        if header.hop_count > 15 {
            return false;
        }
        true
    }

    pub fn relay_message(
        &self,
        mut msg: GossipMessage,
        transport: &mut MavlinkTransport,
        peers: &[PeerInfo],
    ) {
        // Decrement TTL and increment hop count before forwarding
        msg.header.ttl       = msg.header.ttl.saturating_sub(1);
        msg.header.hop_count = msg.header.hop_count.saturating_add(1);
        // Overwrite sender_addr with our own address so next hop can
        // upsert us into their peer list
        msg.header.sender_id   = self.node_id;
        msg.header.sender_addr = self.local_addr.clone();

        for peer in peers {
            transport.send(peer, &msg);
        }
    }
}

Note that the relay node overwrites sender_id and sender_addr with its own identity before forwarding. This is intentional: the gossip header's sender fields track the most recent relay node, not the original message originator. Original message authorship is tracked in the encrypted payload — the Double Ratchet session identifies the true originator cryptographically, immune to spoofing. The gossip header needs only to carry the relay topology for peer-list maintenance purposes.

Anti-entropy reconciliation

The epidemic broadcast model delivers messages with high probability but not certainty. A drone momentarily out of radio range misses all messages sent during the gap. When it re-enters range, the swarm has no mechanism to proactively redeliver what was missed — the originating nodes have already completed their fanout rounds. The Swarm SDK resolves this via an anti-entropy protocol.

Every 5 seconds, each node broadcasts an anti-entropy digest: a compact list of the message IDs it has received in the last 60 seconds. Nodes that receive this digest compare it against their own seen_ids and respond with any messages the requesting node appears to be missing:

pub struct AntiEntropyDigest {
    pub gossip_header: GossipMessageHeader,
    pub message_ids:   Vec<MessageId>,  // up to 200 recent IDs
}

pub struct AntiEntropyResponse {
    pub gossip_header:  GossipMessageHeader,
    pub requested_ids:  Vec<MessageId>,
    pub payloads:       Vec<GossipMessage>,  // filled-in missing messages
}

impl GossipNode {
    pub async fn anti_entropy_tick(
        &self,
        transport: &mut MavlinkTransport,
    ) {
        // Build digest of our last 60 seconds of received message IDs
        let recent_cutoff_ms = unix_time_ms() - 60_000;
        let recent_ids: Vec<MessageId> = self.message_log
            .iter()
            .filter(|entry| entry.received_at_ms >= recent_cutoff_ms)
            .map(|entry| entry.id)
            .take(200)  // cap at 200 IDs per digest broadcast
            .collect();

        let digest = AntiEntropyDigest {
            gossip_header: self.make_header(),
            message_ids:   recent_ids,
        };

        // Broadcast digest to all known peers (not sampled — full list)
        for peer in &self.peer_list {
            transport.send(peer, &digest).await;
        }
    }

    pub async fn handle_anti_entropy_digest(
        &self,
        digest: &AntiEntropyDigest,
        transport: &mut MavlinkTransport,
    ) {
        // Find message IDs in the digest that we have but they apparently don't
        // (i.e., IDs not in their digest but in our log)
        let their_ids: HashSet<MessageId> = digest.message_ids.iter().cloned().collect();
        let missing: Vec<GossipMessage> = self.message_log
            .iter()
            .filter(|entry| !their_ids.contains(&entry.id))
            .filter_map(|entry| self.message_store.get(&entry.id).cloned())
            .collect();

        if missing.is_empty() {
            return;
        }

        let response = AntiEntropyResponse {
            gossip_header:  self.make_header(),
            requested_ids:  missing.iter().map(|m| m.id).collect(),
            payloads:       missing,
        };
        transport.send_to(&digest.gossip_header.sender_addr, &response).await;
    }
}

The 200-ID cap on the digest payload is a MAVLink framing constraint. A single anti-entropy digest must fit in a small number of TUNNEL frames to avoid dominating the radio link. At 16 bytes per message ID, 200 IDs is 3,200 bytes — approximately 25 TUNNEL frames at 128 bytes each — sent once every 5 seconds. This represents a modest background bandwidth of about 5,120 bytes per 5-second window, or roughly 1 kbps of overhead per peer pair, well within budget.

The message_store that backs anti-entropy responses keeps the raw encrypted payloads for 60 seconds after receipt. This is the same 60-second window used by the peer expiry and the anti-entropy digest. Any message older than 60 seconds is considered expired: even if a node missed it, the session layer will have moved on (ratchet advanced, keys rotated) to a point where replaying the old payload would be meaningless. The 60-second window is long enough to cover any realistic partition or outage, and short enough to keep the message_store from growing unboundedly.

Partition handling — split-brain scenario

Drone swarms operating over complex terrain or in contested electromagnetic environments will experience network partitions. A valley between two groups of drones, a sustained jamming burst, or a coordinated electronic warfare event can split the swarm into two or more disconnected subgroups for seconds to minutes. The gossip layer must handle this correctly without corrupting session state.

The partition lifecycle has five phases:

  1. Partition onset. Each subgroup continues operating independently. Gossip broadcast within each partition proceeds normally; messages simply do not reach the other partition. No configuration change is required and no error state is entered. Each subgroup's nodes continue advancing their Double Ratchet sessions for messages they exchange within the partition.
  2. Isolation detection. If a node fails to receive any message from any peer for more than 10 seconds — not just gossip, but any traffic including heartbeats — it enters ISOLATED mode. Encrypted payload transmission continues locally (all data is stored in the ratchet session); the gossip layer buffers outgoing messages up to 1 MB. This buffer covers approximately 5,000 frames at typical payload sizes — enough for several minutes of sensor data in a low-rate telemetry scenario.
  3. Partition merge. When drones from the two partitions re-enter radio range, gossip heartbeats from the other partition appear in each node's receive queue. Nodes immediately exit ISOLATED mode and flush their buffered outgoing messages through the now-restored fanout.
  4. Anti-entropy reconciliation. The next anti-entropy tick (within 5 seconds of re-merging) broadcasts each node's recent message ID digest to the newly reachable peers. Nodes from each partition receive digests from nodes in the other partition and respond with the payloads that were missed during the split. Messages are replayed in causal order using the Lamport clock mechanism described above.
  5. Ratchet reconciliation. The Double Ratchet handles key state divergence using its out-of-order key cache (100-key lookahead per epoch). When replayed messages arrive in the merge phase, the ratchet layer processes them as it would any other out-of-order delivery — pre-deriving and caching keys as needed. Sessions that diverged during the partition converge to a consistent state within the first round-trip of messages exchanged after merge.

This partition handling design intentionally avoids any notion of “split-brain resolution” at the application layer. The gossip mesh makes no attempt to determine which partition's view of state was “correct” during the split — that is the session layer's responsibility. The gossip layer's job is to ensure that after merge, all nodes have seen all messages from all nodes. Given that guarantee, the session layer's causal ordering and Double Ratchet sequencing can reconstruct a consistent view of session state deterministically.

Performance on target hardware

All gossip operations run on the same STM32H7 and Jetson Nano target platforms as the rest of the Swarm SDK. Benchmarks were collected at mission-representative conditions: 16-node swarm, 4–10 messages per second per active pair, with the full cryptographic stack (Double Ratchet encrypt/decrypt) active concurrently.

Gossip layer benchmarks
Platform: STM32H7 (Cortex-M7, 480 MHz) / Jetson Nano (ARM Cortex-A57, 1.43 GHz)

Operation                                 STM32H7 p50   STM32H7 p99   Jetson Nano p50
──────────────────────────────────────────────────────────────────────────────────────
Gossip tick (encode + 3× send)               3.2 ms        8.0 ms          0.4 ms
Message dedup lookup (VecDeque scan)          0.8 ms        2.1 ms          0.05 ms
Anti-entropy digest (200 IDs)                1.1 ms        3.0 ms          0.12 ms
Peer list upsert                             0.2 ms        0.5 ms          0.02 ms
Peer list expiry scan (32 entries)           0.1 ms        0.3 ms          0.01 ms
Causal dependency check (empty queue)        0.05 ms       0.1 ms          0.005 ms

The gossip tick at 3.2 ms p50 is the dominant operation. It includes MAVLink TUNNEL frame serialization, 3× transport sends, and the VecDeque dedup check. With a 250 ms gossip interval, the tick consumes 1.3% of available CPU time on the H7 — negligible on the flight-critical scheduling path. The 8 ms p99 is driven by radio link variance (occasionally one of the 3 sends blocks waiting for a full transmit buffer), not by computation.

The 0.8 ms dedup lookup at p50 reflects a sequential scan of up to 1,000 16-byte entries in the VecDeque. The scan terminates early on a match; in practice most matches occur in the first few hundred entries because recently relayed messages dominate the receive stream. The 2.1 ms p99 corresponds to a full 1,000-entry scan for a genuinely new message ID.

Anti-entropy digest generation at 1.1 ms p50 scans the message_log for entries in the last 60-second window and serializes up to 200 IDs. This operation runs every 5 seconds, contributing 0.022% CPU utilization on the H7. Its p99 of 3.0 ms reflects the worst case of a full 200-entry digest under heavy traffic conditions.

On the Jetson Nano, all gossip operations are faster by 8–40× due to the A57's higher clock speed and out-of-order execution. The Nano runs gossip as a background Tokio task and is never CPU-bound by it. The STM32H7 numbers are the binding constraint for hardware certification; all gossip operations complete well within their scheduling budgets on that platform.

Integration with the cryptographic session layer

The gossip mesh is a transport, not a security boundary. It routes encrypted MAVLink TUNNEL frames without inspecting or modifying their payloads. The session layer — Double Ratchet sessions established via ML-KEM-768 + X25519 hybrid key exchange — sits entirely above the gossip layer and operates on the encrypted ciphertext that the gossip layer treats as opaque bytes.

This separation has several important properties. A node that relays a gossip message learns the message_id (used for dedup), the sender_id and sender_addr (used for peer list maintenance), the hop_count and TTL (used for relay decisions), and the timestamp (used for message_log expiry). It learns nothing about the plaintext content, the session identifiers embedded in encrypted headers, the ratchet epoch, or any key material. Relay nodes from different organizations can coexist in the same mesh without requiring mutual trust beyond routing.

The Sender Keys group encryption mechanism (introduced in Swarm SDK v0.3) uses gossip broadcast specifically for its SPK distribution model. When a drone rotates its Sender Key, the new key is wrapped with each peer's ML-KEM-768 public key and broadcast via gossip. The broadcast reaches all nodes in the mesh within ⌈log3(N)⌉ hops — typically under 2 seconds for a 64-drone swarm. Because Sender Keys provide O(1) group encryption (encrypt once to the group, not once per recipient), the gossip broadcast of SPK updates carries O(N) per-recipient wrappers but only a single encrypted payload per message. The gossip layer is agnostic to this structure; it sees only the TUNNEL frame bytes.

Key revocation messages follow the same path. When a device is revoked — drone captured, operator credential compromised — the fleet operator issues a signed RevocationMessage that is broadcast via gossip. Every node in the mesh receives it within the gossip convergence time, processes it through the causal ordering layer to ensure ordering with respect to any in-flight SPK rotations, and immediately stops accepting messages from the revoked node's session identifiers. The gossip mesh provides the delivery substrate; the key management layer provides the semantics.

For the cryptographic details of the session establishment that the gossip layer supports — the ML-KEM-768 + X25519 hybrid handshake, the Double Ratchet KDF chains and header encryption, and the out-of-order key cache — see the related articles below. The gossip mesh is the delivery layer; everything above it assumes a reliable best-effort channel and builds its own sequencing and ordering on top.


Related technical articles: