Technical writing

Swarm SDK architecture: gossip mesh, post-quantum cryptography, and embedded-first design

· 8 min read· AI Analytics
Swarm SDKCryptographyPost-quantumDrone

The Swarm SDK is a protocol stack for encrypted, partition-tolerant communication across autonomous drone meshes. It has no fixed endpoints, no ground-control dependency, and no reliance on PKI infrastructure. It runs in 96 KB of RAM on a Cortex-M7 and provides post-quantum forward secrecy for every message. This post describes the three-layer architecture before the per-component deep-dives that follow in January through March 2026.

Why a new protocol stack

Autonomous drone swarms operate under conditions that eliminate the entire field of conventional secure-communication tooling. The constraints are not theoretical — they are engineering realities for any fleet that flies beyond visual line of sight, operates in contested electromagnetic environments, or must continue functioning after the ground control station loses radio contact.

MAVLink provides message framing and a rich library of telemetry types, but it was designed for ground-control-to-drone links, not peer-to-peer mesh communication. There is no encryption in the MAVLink v2 specification; authentication is limited to a simple HMAC signing extension that protects message integrity but not confidentiality, and the protocol has no concept of forward secrecy or session management across a mesh of peers.

WireGuard solves the confidentiality problem admirably for IP-routed networks, but requires a PKI for peer authentication and assumes stable IP addressing. Constrained mesh radios operating at 900 MHz do not route IP packets reliably — they deliver radio frames with significant loss and reordering, and the addressing model changes as drones move. WireGuard's handshake is designed for a two-party tunnel, not for an N-to-N mesh where any drone may need to communicate with any other without prior contact.

The Signal Protocol provides the best-in-class session cryptography — X3DH for asynchronous key agreement, Double Ratchet for per-message forward secrecy — but it assumes always-online servers for prekey distribution and message relay. Signal's architecture is server-mediated by design; removing the server removes the core delivery mechanism. In an autonomous swarm with no ground control, there is no server.

The Swarm SDK is designed for the intersection of these constraints: intermittently-connected drone meshes, no ground control dependency, post-quantum key exchange, embedded operation on microcontrollers, and no fixed endpoints that can be blocked or targeted. The result is a three-layer architecture where each layer has a narrow responsibility and can be upgraded independently.

Layer 1 — Gossip Mesh

The gossip mesh is the delivery layer. It routes encrypted MAVLink v2 TUNNEL frames between drones without any fixed topology, leader election, or routing table maintenance. Each node maintains a partial view of the network — a peer list of up to 32 nodes updated on every received message — and uses epidemic broadcast to propagate messages through the swarm.

The broadcast model uses fanout k = 3: a node with a message picks three random peers and sends the message to each. Each recipient that has not already seen this message ID does the same. The epidemic spreads in ⌈log3(N)⌉ hops — four hops for a 64-drone swarm, five for 128 drones. Message deduplication uses a 1,000-entry sliding window of UUIDv4 message IDs stored in a pre-allocated VecDeque. There are no heap allocations in the deduplication path; the entire structure is sized at initialization and never reallocates.

Key management messages — Sender Key rotations, revocation broadcasts — require causal ordering that the best-effort epidemic model does not guarantee. The gossip layer provides a causal ordering sublayer using Lamport clocks: messages declare their dependencies as (node_id, clock) pairs, and a receiving node holds causally dependent messages in a pending queue until all declared predecessors have been processed. The pending queue is almost always empty in practice, since key management messages are infrequent and the gossip mesh converges within 1–2 seconds for typical swarm sizes.

A TTL of 7 hops limits propagation depth. Each relay node decrements the TTL before forwarding; messages with TTL = 0 are delivered locally but not relayed. With k = 3 and TTL = 7, the remaining three hops beyond the minimum required provide redundancy for lossy RF channels and multi-hop paths in spread-out formations.

Network partitions are handled via store-and-forward and anti-entropy reconciliation. A partitioned node buffers outgoing messages in a 1 MB ring buffer. When connectivity is restored, each node broadcasts a compact digest of the last 200 message IDs it has seen in the past 60 seconds. Peers that receive this digest respond with any messages the requesting node appears to be missing. Within five seconds of a partition healing, all nodes in both former sub-meshes have seen all messages that either side produced during the split.

Layer 2 — Cryptographic Core

The cryptographic core sits above the gossip mesh and below the application. It is responsible for session establishment, per-message encryption, group key distribution, sender anonymity, and deniable authentication. The gossip layer sees only opaque ciphertext; the cryptographic layer sees only pre-authenticated plaintext. The separation means the routing layer can be upgraded, debugged, and benchmarked in isolation.

X3DH session establishment

The initial session handshake uses Extended Triple Diffie-Hellman (X3DH), adapted from the Signal Protocol with a post-quantum modification. Standard X3DH performs four Diffie-Hellman operations using Curve25519. The Swarm SDK replaces the first DH operation (DH1, between the initiator's ephemeral key and the responder's identity key) with ML-KEM-768 encapsulation: the initiator generates an ML-KEM-768 ciphertext against the responder's identity public key, contributing the resulting shared secret to the key derivation chain alongside the three remaining X25519 operations.

The resulting hybrid KDF chain is:

SK = KDF(
    ML-KEM-768.Decaps(IK_B_kem, ct)   // KEM shared secret
    || DH(EK_A, SPK_B)               // X25519 DH2
    || DH(IK_A, OPK_B)               // X25519 DH3
    || DH(EK_A, OPK_B)               // X25519 DH4
)

Because prekey bundles are distributed via gossip broadcast rather than a server, a drone that wants to establish a session with an offline peer retrieves the bundle from the gossip message store (where it was cached when the target last broadcast it) and initiates the X3DH without any round-trip. The target decrypts the initial message when it next comes online and derives the same session key. The session is now running Double Ratchet; no further X3DH handshakes are required unless the session expires.

Double Ratchet forward secrecy

Once a session is established, all messages use the Double Ratchet algorithm. The ratchet maintains two KDF chains: a sending chain and a receiving chain, each derived via HKDF-SHA-256. Every message advances the chain, deriving a fresh message key and discarding the previous one. Compromise of the current message key reveals nothing about past keys (forward secrecy) and, after the next Diffie-Hellman ratchet step, nothing about future keys (post-compromise security).

The DH ratchet step incorporates a new Curve25519 key exchange on every reply, binding post-compromise healing into the normal message flow. The 100-key lookahead cache handles out-of-order delivery: keys for messages that have not yet arrived are pre-derived and cached, so a message arriving after its successors can still be decrypted without breaking the chain. The cache is bounded at 100 entries per epoch to limit memory use on embedded targets.

Sealed Sender

Sealed Sender hides the sender's identity from relay nodes. The recipient issues an ephemeral X25519 certificate that any sender can use to encrypt their identity before encrypting the payload. A relay node receives a frame that it can forward but cannot link to a specific sender device without breaking AES-GCM — the relay sees the recipient certificate, not the sender's identity key.

Sender Keys group encryption

Direct Double Ratchet sessions scale as O(N) per sender: encrypting one message to all N peers in a swarm requires N separate encrypt operations. For N = 64 drones this is acceptable but wasteful; for N = 128 it strains the embedded CPU budget. Sender Keys provide O(1) group encryption: a drone generates a single group key (the Sender Key), encrypts the payload once, and broadcasts the ciphertext. The Sender Key itself is distributed via gossip using individual X3DH-encrypted SenderKeyDistributionMessage (SKDM) packets per recipient — a one-time O(N) cost per key rotation, paid rarely.

Group key rotations happen every seven days under normal operation or immediately when a device is revoked. The SKDM broadcast travels via gossip, reaching all nodes in the mesh within the epidemic convergence time.

Deniable HMAC authentication

Each message carries a deniable HMAC tag computed over the ciphertext using a shared symmetric key derived from the session. Unlike asymmetric signatures, HMAC tags cannot be attributed to a specific sender by a third party — any party holding the shared key could have produced the tag. This provides message authenticity within the session while preserving sender deniability to external observers.

Layer 3 — Hardware Abstraction and Transport

The transport layer adapts the gossip mesh's delivery model to the physical constraints of MAVLink v2 radio links. It handles frame fragmentation and reassembly, reliable delivery over lossy RF, and the hardware-specific details of AES acceleration and memory allocation on embedded targets.

MAVLink v2 SWARM_MESH_FRAME

Encrypted gossip payloads are carried in MAVLink v2 TUNNEL messages (message_id 385) with a custom 16-byte SwarmFrame header prepended within the 253-byte TUNNEL payload field. The SwarmFrame header encodes the gossip message_id, hop_count, TTL, and fragment sequence number. Large gossip messages — prekey bundles, anti-entropy responses — are fragmented into 253-byte chunks and reassembled on the receiving side using the sequence number in the SwarmFrame header.

// SwarmFrame header (16 bytes, prepended to TUNNEL payload)
#[repr(C, packed)]
pub struct SwarmFrameHeader {
    pub magic:        [u8; 2],  // 0x53, 0x57 ("SW")
    pub version:      u8,       // protocol version
    pub flags:        u8,       // fragmented | last_fragment | causal
    pub message_id:   [u8; 4],  // first 4 bytes of UUIDv4
    pub hop_count:    u8,
    pub ttl:          u8,
    pub frag_index:   u8,       // 0-based fragment index
    pub frag_total:   u8,       // total fragment count
    pub payload_len:  u16,      // bytes in this fragment
}
// Remaining 237 bytes of TUNNEL payload: encrypted gossip payload fragment

Mesh transport ARQ

The gossip layer provides best-effort delivery; the transport ARQ layer provides reliable delivery for sessions that require it. ARQ uses a sliding window protocol with selective acknowledgement (SACK) and exponential backoff. Round-trip time is tracked using an exponentially weighted moving average (EWMA) with a smoothing factor of 0.125 (the same as TCP's default). The sliding window size is four frames by default, tunable per link based on observed RTT and loss rate.

Embedded HAL

The primary embedded target is the STM32H7 series: Cortex-M7 at 480 MHz, 1 MB RAM, 2 MB flash. The secondary target is the NVIDIA Jetson Nano (ARM Cortex-A57, 4 GB RAM) for mission computers that run a full Linux stack. All crates in the swarm-core, swarm-crypto, swarm-mesh, and swarm-mavlink crates compile with #![no_std] and a no_std feature flag that disables all std-dependent functionality.

Memory management on embedded targets uses a static heap of 96 KB allocated via cortex-m-alloc. All data structures that require dynamic allocation — the deduplication VecDeque, the out-of-order key cache, the fragment reassembly buffer — are pre-allocated at initialization and never grow beyond their initial capacity. There are no heap allocations in the hot path (message receive, deduplication check, decrypt, deliver).

AES-GCM encryption on the STM32H7 uses the hardware AES accelerator via the HAL's stm32h7xx-hal crate. The hardware path runs at 0.14 ms per 200-byte payload; the software fallback (for targets without a hardware AES unit) runs at 0.61 ms. The hardware accelerator is used for all AES-GCM operations in the Double Ratchet and Sender Keys paths; the ML-KEM-768 operations (which use SHA-3 internally) run in software on the Cortex-M7.

Post-quantum threat model

The harvest-now-decrypt-later attack is the primary post-quantum threat for drone communications. An adversary that records encrypted swarm traffic today and stores it for future decryption will be able to retroactively decrypt all traffic if they later develop a cryptanalytically relevant quantum computer — unless that traffic was protected by post-quantum key exchange from the start.

NIST finalized ML-KEM-768 (CRYSTALS-Kyber) as FIPS 203 in August 2024. The Swarm SDK uses ML-KEM-768 in a hybrid construction with X25519: both algorithms must be broken to recover the session key. An adversary who breaks X25519 using a quantum computer cannot retroactively decrypt traffic because the ML-KEM-768 component provides quantum-resistant protection. An adversary who breaks ML-KEM-768 (via a classical cryptanalysis advance) is still protected by X25519. The hybrid construction provides defense in depth against advances on either algorithm class.

The Double Ratchet provides per-message forward secrecy on top of the post-quantum key exchange. Even if a long-term identity key is compromised — the device is captured, the key material is extracted — the compromise reveals only the current session state. Past messages, protected by keys that have already been ratcheted forward and discarded, remain secure. Future messages, protected by keys that will be derived from new DH ratchet steps, become secure again within one round-trip of the compromise being detected and the device revoked.

The Swarm SDK targets CNSA 2.0 compliance (the NSA's post-quantum requirements for national security systems) by 2027. CNSA 2.0 mandates ML-KEM for key establishment and ML-DSA for signatures; the current v0.3 release implements ML-KEM-768 for key establishment and uses a classical HMAC for message authentication, with ML-DSA integration scheduled for v0.5.

Embedded constraints and binary size

Embedded operation on the STM32H7 imposes constraints that shaped every design decision in the SDK. The 1 MB RAM envelope means the entire cryptographic state — all active sessions, the deduplication window, the fragment reassembly buffers, the ARQ sliding window, the out-of-order key cache — must coexist with the flight controller's own state. The 2 MB flash envelope means the SDK binary plus the application must fit comfortably.

Binary size breakdown (STM32H7, opt-level="z" + LTO):
  swarm-core     (types, serialization, crypto)   112 KB
  swarm-crypto   (X3DH, Double Ratchet, SK, SS)    98 KB
  swarm-mesh     (gossip, ARQ, anti-entropy)        42 KB
  swarm-mavlink  (MAVLink v2 framing)               32 KB
  ─────────────────────────────────────────────────────────
  Total SDK binary                                 284 KB

Static heap allocation at runtime:
  Dedup VecDeque (1,000 × 16 bytes)               16 KB
  Active sessions (max 16, ~2 KB each)             32 KB
  Fragment reassembly buffers (4 × 1,024 bytes)     4 KB
  ARQ sliding window (4 frames × 256 bytes)          1 KB
  Out-of-order key cache (100 entries × ~80 bytes)   8 KB
  Gossip peer list (32 × 64 bytes)                   2 KB
  Message store (anti-entropy, 200 × 256 bytes)     50 KB
  ─────────────────────────────────────────────────────────
  Total static heap                                113 KB  (fits in 96 KB with session reduction to 8 max)

The production configuration for STM32H7 limits concurrent sessions to 8 (rather than 16), bringing static heap usage to 81 KB within the 96 KB allocation. Missions with more than 8 active session peers use Sender Keys for the bulk payload path and reserve individual Double Ratchet sessions for point-to-point communications with mission-critical nodes (ground control handoff, fleet operator links).

The no_std build eliminates the standard library allocator, the Rust runtime, and all OS-dependent functionality. Panic handling uses a bare-metal handler that triggers a watchdog reset rather than unwinding. The build is validated against thumbv7em-none-eabihf (Cortex-M7 with hardware FPU) and confirmed to produce no implicit heap allocations using the no-std-check CI step.

Crate structure

The SDK is organized as a Cargo workspace with six member crates. The split between no_std crates (embedded deployable) and std crates (ground station / CI tooling) is enforced at the crate boundary rather than by feature flags inside a monolithic crate, making it impossible to accidentally pull std dependencies into the embedded binary.

# Cargo.toml workspace layout
[workspace]
members = [
    "swarm-core",    # no_std: types, serialization, crypto primitives
    "swarm-crypto",  # no_std: X3DH, Double Ratchet, Sealed Sender, Sender Keys
    "swarm-mesh",    # no_std: gossip, mesh transport ARQ, anti-entropy
    "swarm-mavlink", # no_std: MAVLink v2 framing, TUNNEL messages
    "swarm-std",     # std: fleet CA, cert management, key provisioning, logging
    "swarm-sim",     # std: simulation harness for testing without hardware
]

swarm-core defines the shared types — frame headers, peer descriptors, session identifiers, error types — and the serialization logic that all other crates depend on. It has no cryptographic dependencies; it exists to prevent circular imports between swarm-crypto and swarm-mesh.

swarm-crypto implements the full cryptographic stack: X3DH key bundle generation and session initiation, Double Ratchet encrypt and decrypt, Sealed Sender wrapping and unwrapping, Sender Key generation and SKDM encoding, and the deniable HMAC layer. All ML-KEM-768 operations are provided by the ml-kem crate (a no_std-compatible implementation of FIPS 203).

swarm-mesh implements the gossip protocol, anti-entropy reconciliation, and the ARQ transport. It depends on swarm-core for frame types and on swarm-crypto for message authentication, but treats encrypted payloads as opaque byte slices.

swarm-mavlink handles MAVLink v2 framing: TUNNEL message construction, SwarmFrame header encoding/decoding, and fragmentation/reassembly. It is the only crate with a dependency on the MAVLink message definitions.

swarm-std runs only on the ground station or CI host. It provides the fleet certificate authority (signing drone identity keys), the certificate rotation workflow, the key provisioning tool that produces mission cert bundles, and the structured logging backend. Because it is std-only, it can use the full Rust ecosystem without embedded size constraints.

swarm-sim provides a simulation harness that runs a configurable number of virtual drone nodes in a single process, using in-memory channels to simulate radio links with configurable loss rates and latency distributions. All integration tests — including the partition healing and anti-entropy reconciliation tests — run against swarm-sim before any hardware-in-the-loop testing.

Performance benchmarks on STM32H7

All operations in the critical path were benchmarked on the STM32H7 at 480 MHz with the hardware AES accelerator active. The benchmarks represent a mission-representative load: 16 active drone sessions, gossip traffic at 4 ticks per second, and concurrent Sender Key group traffic at 10 messages per second.

Operation                          Time (p50)
────────────────────────────────────────────────────
X3DH session init (PQ hybrid)       62 ms
Double Ratchet encrypt (200 B)       1.8 ms
Sender Key encrypt (200 B)           0.7 ms
AES-GCM (hardware accelerator)       0.14 ms
AES-GCM (software fallback)          0.61 ms
Gossip tick (k=3 fanout)             3.2 ms
Deduplication check (VecDeque)       0.8 ms
Fragment (200 B → 2 frames)          0.09 ms
Mesh transport ARQ cycle             0.4 ms

The X3DH session initialization at 62 ms is the most expensive operation but is infrequent — sessions are established once and persist for the duration of a mission (typically hours). The 62 ms cost includes two ML-KEM-768 operations (encapsulation on the initiator side; decapsulation is performed by the recipient on a separate tick) and three X25519 scalar multiplications. Both operations use software implementations on the Cortex-M7; the STM32H7 does not have hardware acceleration for ML-KEM or elliptic curve operations.

The Double Ratchet encrypt path at 1.8 ms dominates steady-state CPU usage for point-to-point sessions. It includes HKDF-SHA-256 key derivation, AES-GCM encryption (hardware accelerated), and the HMAC authentication tag. The Sender Key path at 0.7 ms is faster because it skips the HKDF key derivation step — the group key is already cached — and goes directly to AES-GCM.

The gossip tick at 3.2 ms consumes 1.3% of available CPU time at a 250 ms gossip interval. The deduplication check at 0.8 ms for a VecDeque scan of 1,000 entries reflects the entire 1,000-entry scan in the worst case; typical performance is faster because the most recently relayed message IDs appear near the head of the deque. Together, gossip overhead occupies well under 5% of the flight controller budget, leaving ample headroom for the PX4 flight stack running concurrently.

Security properties

The Swarm SDK makes the following formal security claims:

  1. Forward secrecy. Compromise of the current session key does not decrypt past messages. The Double Ratchet discards each message key after use; past keys cannot be derived from the current chain state.
  2. Post-compromise security. The Double Ratchet heals automatically after a key compromise. After the next DH ratchet step (the first reply in the session following the compromise), future message keys are derived from fresh Diffie-Hellman material that the attacker does not possess.
  3. Sender anonymity. A relay node cannot link a Sealed Sender message to the originating device without breaking AES-GCM on the outer encryption layer. The relay sees the recipient's ephemeral certificate, not the sender's identity.
  4. Deniability. HMAC authentication tags cannot be attributed to a specific sender by a third party. Because the tag is computed from a shared symmetric key (rather than a private signing key), any party holding the shared session key could have produced it.
  5. Partition tolerance. Messages produced during a network partition are buffered and redelivered after partition healing via anti-entropy reconciliation. No message is permanently lost due to a temporary network split, provided the partition duration is under 60 seconds (the message store retention window).

The post-quantum claim specifically addresses the harvest-now-decrypt-later attack: the hybrid ML-KEM-768 + X25519 key exchange means an attacker who stores today's traffic and later develops a quantum computer that breaks X25519 cannot decrypt it, because the ML-KEM-768 component resists quantum attack. Both components must be broken simultaneously to compromise the session key.

Article series roadmap

This overview introduces the three-layer architecture. The articles that follow each cover a single component in depth:

  • Gossip mesh routing — epidemic broadcast fanout, UUIDv4 deduplication ring, Lamport clock causal ordering, TTL hop limiting, anti-entropy reconciliation, and partition-tolerant store-and-forward. (January 2026)
  • Mesh transport ARQ — sliding window reliable delivery, SACK, EWMA RTT estimation, and backpressure handling over lossy RF. (January 2026)
  • X3DH session establishment — prekey bundle construction, ML-KEM-768 encapsulation replacing DH1, one-time prekey consumption, and the transition to Double Ratchet. (January 2026)
  • Double Ratchet forward secrecy — KDF chain mechanics, DH ratchet step, HKDF-SHA-256 derivation, 100-key lookahead cache, and out-of-order delivery. (February 2026)
  • Sealed Sender — recipient-issued ephemeral certificate construction, sender identity encryption, and relay opacity guarantees. (February 2026)
  • Sender Keys and v0.3 — O(1) group encryption, SKDM gossip distribution, revocation-triggered key rotation, and deniable HMAC authentication. (February 2026)
  • MAVLink integration — TUNNEL message framing, SwarmFrame header layout, and the MAVLink-to-gossip bridge. (February 2026)
  • Key management — device provisioning, fleet CA, certificate rotation, and revocation for autonomous drone systems. (March 2026)
  • Device enrollment — mission cert bundle provisioning, prekey bundle broadcast, and the drone startup sequence. (March 2026)
  • Post-quantum design rationale — FIPS 203 finalization, CNSA 2.0 requirements, hybrid construction security arguments, and the migration path from classical-only to PQ-hybrid. (March 2026)
  • v0.4 features — situational awareness extensions, electronic warfare coordination, and adversarial resilience improvements. (March 2026)

Related technical articles: