Technical writing

The Swarm SDK double ratchet: forward secrecy and post-compromise security in drone mesh networks

· 9 min read· AI Analytics
Swarm SDKCryptographyPost-quantumDrone

A session key compromise in a classical TLS-style protocol is catastrophic: decrypt the handshake, derive the session key, and every byte of that session is plaintext. Drone communications demand better. The Swarm SDK implements the Double Ratchet algorithm — the same cryptographic core used by Signal — adapted for a post-quantum threat model, MAVLink v2 framing, and the harsh realities of embedded ARM hardware running at altitude.

Why forward secrecy matters for drone communications

Classical session security uses a single session key derived at handshake time. Every message in a session is encrypted and decrypted with keys derived from that root. An adversary who compromises the session key — by capturing a drone, extracting memory, or breaking the key exchange — can decrypt every past and future message in that session. One compromise, total exposure.

Forward secrecy changes the calculus. With per-message keys derived from a ratcheting key derivation function, compromising the key state at message N does not expose messages 0 through N−1. Those keys were derived, used, and immediately deleted. They cannot be reconstructed from present state. The “blast radius” of a compromise is bounded to messages from the compromised point forward, and even that window closes as the ratchet advances.

Post-compromise security goes further. After an adversary obtains current key state — say, by recovering a downed drone's secure SRAM before it zeroes — forward secrecy alone does not help with future messages. The adversary holds the chain key and can derive everything that follows. The Double Ratchet's DH ratchet component solves this: once the remaining nodes exchange new Diffie-Hellman ephemeral keys (which the adversary cannot forge), the session “heals”. The adversary's captured key material cannot decrypt messages sent after that ratchet step.

The drone threat model makes both properties essential. Captured hardware is a routine operational risk. A downed drone may be recovered, physically inspected, and its volatile memory read before the zeroization watchdog fires. Forward secrecy limits the damage to the active window; post-compromise security ensures that once the drone is neutralized, the rest of the mesh continues operating securely. No re-keying ceremony at a secure facility is required — the ratchet handles it automatically as messages flow.

The Double Ratchet: two interlocked mechanisms

The Double Ratchet algorithm was specified by Marlinspike and Perrin in 2016 and forms the cryptographic core of the Signal Protocol. It combines two distinct ratchet mechanisms that operate at different timescales.

The symmetric-key ratchet (KDF chain) advances with every message. It takes the current chain key, feeds it through a key derivation function, and produces two outputs: a new chain key (passed to the next step) and a message key (used to encrypt or decrypt this message and then immediately deleted). The chain key advances monotonically — old keys are gone, and the one-way nature of the KDF makes them unrecoverable from the new chain key. This delivers forward secrecy at the per-message granularity.

The Diffie-Hellman ratchet (asymmetric ratchet) advances at a coarser timescale: one step per exchanged message with a new ephemeral DH key pair. Each ratchet step generates a new DH shared secret that is mixed into the root chain, which then re-seeds the sending and receiving KDF chains. This is the “healing” mechanism. Once a new DH ratchet step completes, an adversary holding the old chain state cannot derive the new chain keys. Post-compromise security is restored.

The two ratchets interlock: the DH ratchet provides the root key material that reseeds the KDF chains, and the KDF chains provide per-message forward secrecy between DH ratchet steps. Together they give a protocol where every message has a unique key and the session self-heals after compromise.

Adapting the DH ratchet for ML-KEM-768

The classical Double Ratchet uses X25519 for the DH ratchet. Both parties generate ephemeral X25519 key pairs, exchange public keys in message headers, and compute a shared secret via Diffie-Hellman: dh_output = X25519(my_sk, their_pk). The operation is symmetric — either side can compute the same output from their own secret and the other's public key.

ML-KEM-768 does not work this way. It is a Key Encapsulation Mechanism, not a key agreement protocol. The operation is asymmetric: the sender calls encapsulate(recipient_pk) and receives a shared secret plus a ciphertext. The recipient calls decapsulate(ciphertext, recipient_sk) to recover the same shared secret. Only the recipient's private key can decapsulate; the sender cannot derive the output from the ciphertext alone.

This asymmetry reshapes the ratchet. Instead of a symmetric DH exchange, the Swarm SDK implements an encapsulation ratchet. The flow for a single ratchet step:

  1. Alice generates a new ML-KEM-768 key pair: (alice_pk_new, alice_sk_new).
  2. Alice encapsulates to Bob's current public key: (kem_ct, kem_ss) = ml_kem_768.encapsulate(bob_pk_current). She includes kem_ct and alice_pk_new in the encrypted message header.
  3. Bob receives the message, decapsulates: kem_ss = ml_kem_768.decapsulate(kem_ct, bob_sk_current). He now holds the same kem_ss as Alice.
  4. Bob derives the new root key from kem_ss and advances his sending and receiving chains.
  5. On Bob's next message, he encapsulates to alice_pk_new, providing Alice's ratchet step in turn.

The ratchet remains bidirectional and self-healing. The key difference from classical Double Ratchet is that each ratchet step is initiated by the sender, not negotiated symmetrically. The recipient can only advance their ratchet once they receive the new encapsulation ciphertext. This is an inherent property of KEMs, not a protocol weakness — the healing still happens, it just requires a message to flow.

The hybrid approach: ML-KEM-768 roots, X25519 ratchet steps

A fully ML-KEM-768 ratchet has a fundamental problem for MAVLink v2: ML-KEM-768 public keys are 1,184 bytes and ciphertexts are 1,088 bytes. MAVLink v2's TUNNEL message (message_id 385) carries at most 128 bytes of payload type data plus a 128-byte payload field — embedding a KEM ciphertext in every ratchet step would consume the entire frame budget and then some.

The Swarm SDK resolves this with a layered hybrid strategy. ML-KEM-768 is used exclusively for the initial session establishment (described in the post-quantum drone mesh architecture article). The resulting session root key — derived from both ML-KEM-768 and X25519 shared secrets — is post-quantum secure by construction. Subsequent DH ratchet steps use X25519, whose 32-byte public keys and 32-byte outputs fit comfortably in a message header.

The threat model here is carefully constructed. A post-quantum adversary can break X25519 with a sufficiently powerful quantum computer. What they cannot do is break the ML-KEM-768 component of the initial session key. The root chain that seeds all subsequent KDF derivations was initialized with post-quantum key material. When the X25519 DH ratchet advances, that step's output is mixed into the root chain alongside the existing post-quantum root — the mixing function is:

# Root chain advance on DH ratchet step (X25519 ratchet step)
# root_key is 32 bytes, initialized from ML-KEM-768 + X25519 hybrid at session start
# dh_output is the 32-byte output of X25519(my_ratchet_sk, their_ratchet_pk)

root_key_new, chain_seed = HKDF-SHA-256(
    ikm   = dh_output,
    salt  = root_key,          # PQ-derived root as HKDF salt — mixes PQ material in
    info  = b"swarm-ratchet-root-v1",
    len   = 64                 # 32 bytes root_key_new + 32 bytes chain_seed
)

# Derive new sending chain key from chain_seed
sending_chain_key = HKDF-SHA-256(
    ikm   = chain_seed,
    salt  = b"",
    info  = b"swarm-ratchet-send-v1",
    len   = 32
)

An adversary who can break X25519 recovers dh_outputfor that ratchet step. But the HKDF salt is the PQ-derived root key — unknown to the adversary — so the output of the HKDF is still post-quantum secure. The root chain carries the PQ entropy forward through every subsequent X25519 ratchet step. The X25519 ratchet provides post-compromise recovery against classical adversaries; the ML-KEM-768 root provides post-quantum security throughout the session lifetime.

KDF chains: root, sending, and receiving

The Double Ratchet maintains three KDF chains. Each chain is a 32-byte state that advances with each operation. All derivations use HKDF-SHA-256.

  • Root chain (RC): Advanced on each DH ratchet step. The RC is the long-lived state that carries PQ entropy and provides post-compromise healing. It is never directly used to encrypt messages.
  • Sending chain (SC): Advanced on each message sent. Seeded from the RC at the start of each DH epoch. Each advance produces the next message key for the outgoing message.
  • Receiving chain (RcC): Advanced on each message received. Seeded from the RC when a new DH ratchet step is completed by the remote peer. Each advance produces the expected message key for the incoming message.

The KDF ratchet step for both SC and RcC is identical:

# KDF chain step — advances the sending or receiving chain by one message
# chain_key: current 32-byte chain state
# Returns: (new_chain_key, message_key), each 32 bytes

def kdf_chain_step(chain_key: bytes) -> (bytes, bytes):
    # Two HKDF expansions from the same chain key, differentiated by info string
    message_key = HKDF-SHA-256(
        ikm   = chain_key,
        salt  = b"",                   # constant salt for message key
        info  = b"swarm-msg-key-v1",
        len   = 32,
    )
    new_chain_key = HKDF-SHA-256(
        ikm   = chain_key,
        salt  = b"",                   # constant salt for chain advance
        info  = b"swarm-chain-key-v1",
        len   = 32,
    )
    # message_key is passed to AES-256-GCM; deleted from memory after use
    # new_chain_key replaces chain_key in session state
    return (new_chain_key, message_key)

# AES-256-GCM encryption with the derived message key
def encrypt_message(message_key: bytes, nonce: bytes, plaintext: bytes) -> bytes:
    ciphertext = AES-256-GCM.encrypt(
        key   = message_key,
        nonce = nonce,           # 12-byte random nonce, unique per message
        aad   = session_id || ratchet_epoch || message_index,
        msg   = plaintext,
    )
    zeroize(message_key)         # immediately wiped from memory
    return ciphertext            # includes 16-byte GCM authentication tag

The additional authenticated data (AAD) bound to the AES-256-GCM encryption includes the session identifier, the current DH ratchet epoch, and the per-chain message index. This prevents cross-session ciphertext replay and ensures that a message encrypted in epoch 4 cannot be accepted as a valid epoch 7 message, even if the adversary has managed to desynchronize the ratchet state.

Header encryption: hiding ratchet state from observers

Every Double Ratchet message must carry metadata so the recipient can advance their ratchet to the correct state: the sender's current ratchet public key (so the recipient can complete the DH step), the ratchet epoch number, and the message index within the current epoch. In the original Signal Protocol specification, this header information is transmitted in plaintext.

Plaintext headers are a traffic analysis vector. An observer who cannot decrypt the message content can still watch ratchet epoch numbers advance, count messages per epoch, and correlate header contents across flows. In a drone mesh, this leaks communication topology — which pairs are exchanging messages, at what rate, and when ratchet steps occur. The Swarm SDK encrypts message headers using a dedicated header key derived from the root KDF at session initialization.

# Header key derivation at session start (done once, stored in session state)
header_key_send, header_key_recv = HKDF-SHA-256(
    ikm   = root_key_initial,
    salt  = b"",
    info  = b"swarm-header-keys-v1",
    len   = 64,                            # 32 bytes send + 32 bytes recv
)

# Plaintext header fields (before encryption)
struct MessageHeader:
    ratchet_pk:     bytes[32]              # sender's current X25519 ratchet public key
    ratchet_epoch:  uint32                 # DH ratchet step counter
    message_index:  uint32                 # per-epoch message counter

# On-wire message layout
struct SwarmMessage:
    encrypted_header:  bytes[48]           # AES-256-GCM(header_key_send, header_nonce, MessageHeader)
    header_nonce:      bytes[12]           # random 12-byte nonce for header GCM
    header_tag:        bytes[16]           # GCM authentication tag for header
    body_nonce:        bytes[12]           # random 12-byte nonce for body GCM
    ciphertext:        bytes[N]            # AES-256-GCM(message_key, body_nonce, plaintext)
    body_tag:          bytes[16]           # GCM authentication tag for body

The header key is static for the session lifetime (it does not ratchet). This is a deliberate trade-off: ratcheting the header key would require the recipient to maintain a separate header key chain and attempt decryption with multiple candidate keys for out-of-order messages. The static header key is sufficient because it is derived from the PQ root key material and is never transmitted on the wire — an adversary without the initial session establishment cannot recover it.

The total overhead per message is 76 bytes of header material (48 ciphertext + 12 nonce + 16 tag) plus 28 bytes of body framing (12 nonce + 16 tag). Fixed overhead is 104 bytes regardless of message size. For a typical MAVLink telemetry payload of 32 bytes, this is a 3.25× expansion — acceptable for the security properties delivered.

Out-of-order message handling

Drone mesh networks have consistently high packet reorder rates. RF path diversity means two messages sent consecutively may arrive via different relay paths with different latencies; store-and-forward relay nodes introduce additional jitter; nodes moving at 30 m/s create variable-delay links. A protocol that requires in-order delivery is a protocol that will silently drop messages in the field.

The KDF chain is sequential by construction: to decrypt message N, you must advance the chain N times from its last known state. If message N+5 arrives before message N, the recipient cannot immediately decrypt it — they do not yet have the chain state for message N+5. The solution is pre-derivation and caching.

When the recipient receives a message with an index ahead of their current chain position, they advance the chain forward to that index, caching each intermediate message key they derive along the way. The arriving message is decrypted with the key at its index; cached keys for skipped messages are retained in a sliding window cache:

// Out-of-order key cache type
// Key: (session_id: [u8; 16], ratchet_epoch: u32, message_index: u32)
// Value: message_key [u8; 32]
type SkippedKeyCache = HashMap<(SessionId, u32, u32), [u8; 32]>;

const MAX_SKIP_AHEAD: u32 = 100;    // max messages to pre-derive in one step
const MAX_CACHE_AGE: u32 = 200;     // evict keys older than 200 messages ago

fn try_decrypt_message(
    state: &mut RatchetState,
    cache: &mut SkippedKeyCache,
    header: &MessageHeader,
    ciphertext: &[u8],
) -> Result<Vec<u8>, RatchetError> {
    // 1. Check the skipped-key cache first (out-of-order delivery)
    let cache_key = (state.session_id, header.ratchet_epoch, header.message_index);
    if let Some(msg_key) = cache.remove(&cache_key) {
        return aes_256_gcm_decrypt(msg_key, ciphertext);
    }

    // 2. If the epoch changed, complete the DH ratchet step
    if header.ratchet_epoch != state.current_epoch {
        advance_dh_ratchet(state, &header.ratchet_pk)?;
    }

    // 3. Pre-derive and cache any skipped keys in this epoch
    let skip_count = header.message_index
        .checked_sub(state.recv_chain_index)
        .ok_or(RatchetError::IndexRegression)?;
    if skip_count > MAX_SKIP_AHEAD {
        return Err(RatchetError::TooManySkipped);
    }
    for idx in state.recv_chain_index..header.message_index {
        let (new_ck, msg_key) = kdf_chain_step(state.recv_chain_key);
        state.recv_chain_key = new_ck;
        cache.insert((state.session_id, state.current_epoch, idx), msg_key);
        // Evict stale cache entries
        evict_old_keys(cache, state.session_id, state.current_epoch, idx, MAX_CACHE_AGE);
    }

    // 4. Derive the key for the target message index
    let (new_ck, msg_key) = kdf_chain_step(state.recv_chain_key);
    state.recv_chain_key = new_ck;
    state.recv_chain_index = header.message_index + 1;
    aes_256_gcm_decrypt(msg_key, ciphertext)
}

The MAX_SKIP_AHEAD limit of 100 messages prevents an adversary from forcing unbounded key pre-derivation by sending a message with a very large future index. The MAX_CACHE_AGE limit of 200 messages triggers eviction of old cached keys to bound memory usage — a node that falls more than 200 messages behind in a given epoch will fail to decrypt those messages, but this is preferable to unbounded cache growth on constrained hardware. On an STM32H7 with 1 MB of SRAM, a full 200-entry cache per session costs 200 × 32 bytes = 6.4 KB — negligible.

MAVLink v2 framing

MAVLink v2 is the standard message framing protocol for PX4 and ArduPilot autopilots. Its packet structure is compact and well-supported by autopilot firmware, ground control stations, and drone SDKs across the industry. The Swarm SDK wraps Double Ratchet encrypted payloads inside MAVLink v2 TUNNEL messages (message_id 385) to remain compatible with existing autopilot firmware without requiring any changes to flight-controller code.

# MAVLink v2 packet structure (on-wire layout)
struct MavlinkV2Packet:
    magic:          uint8   = 0xFD          # MAVLink v2 start-of-frame byte
    payload_len:    uint8                   # payload length (0–253 bytes)
    incompat_flags: uint8   = 0x00
    compat_flags:   uint8   = 0x00
    seq:            uint8                   # per-component sequence counter (wraps at 255)
    sys_id:         uint8                   # source system ID (1–255)
    comp_id:        uint8                   # source component ID
    msg_id:         uint24  = 385           # TUNNEL message ID
    payload:        bytes[payload_len]      # TUNNEL payload (see below)
    crc:            uint16                  # CRC-16/MCRF4XX over header + payload + CRC extra

# MAVLink TUNNEL message payload layout (message_id 385)
struct TunnelPayload:
    target_system:    uint8                 # 0 = broadcast
    target_component: uint8                 # 0 = broadcast
    payload_type:     uint16 = 0xAA01       # 0xAA01 = Swarm SDK encrypted frame
    payload_length:   uint8                 # length of actual data in payload field
    payload:          bytes[128]            # Swarm SDK encrypted Double Ratchet frame

# Total TUNNEL overhead: 14 bytes MAVLink v2 header + 4 bytes TUNNEL header + 2 bytes CRC = 20 bytes
# Maximum Swarm SDK frame per TUNNEL: 128 bytes (TUNNEL payload field limit)
# For larger frames, multiple TUNNEL messages are sequenced using a fragmentation header

The payload_type field value 0xAA01 is in the vendor-defined range (0x8000–0xFFFF) of the TUNNEL spec. Autopilot firmware (PX4 and ArduPilot both) passes TUNNEL messages through their routing layer without processing the payload — the autopilot simply forwards them to the appropriate link. This means Swarm SDK encrypted frames are transparently relayed by existing autopilot firmware with no firmware modifications and no risk of the autopilot interfering with encrypted content.

The 128-byte payload field is the binding constraint. Short telemetry messages (typical MAVLink payloads are 22–36 bytes) fit in a single TUNNEL after Double Ratchet overhead (104 bytes fixed + plaintext). For 22-byte plaintext: 104 + 22 = 126 bytes — just inside the limit. Larger payloads (command messages, sensor data blobs) use a simple two-byte fragmentation header (fragment_index, total_fragments) that the Swarm SDK reassembles before passing to the ratchet layer. Fragmented frames are rare in practice since most high-frequency MAVLink messages are compact by design.

Performance on embedded ARM

The Swarm SDK targets Cortex-M7 and above for flight-controller companion computers, with the Jetson Nano as the representative high-end embedded platform. All benchmarks were run with compiler optimizations enabled (Rust release mode with codegen-units=1 and lto=thin), without hardware cryptography acceleration (the STM32H7's AES hardware accelerator is available but was not used in these numbers to establish a conservative baseline).

Benchmark results — Double Ratchet operations
Platform: STM32H7 (Cortex-M7, 480 MHz) and Jetson Nano (ARM Cortex-A57, 1.43 GHz)

Operation                              STM32H7      Jetson Nano
──────────────────────────────────────────────────────────────
X25519 DH ratchet step                 1.2 ms        0.3 ms
HKDF-SHA-256 chain advance             0.08 ms       0.02 ms
AES-256-GCM encrypt (256-byte payload) 0.15 ms       0.04 ms
Header encrypt (AES-256-GCM, 48 bytes) 0.08 ms       0.02 ms
Full message encrypt                   1.8 ms        0.4 ms
  (= X25519 step + HKDF + AES-GCM body + AES-GCM header)

ML-KEM-768 encapsulation (initial KE)  12 ms         2.1 ms
ML-KEM-768 decapsulation               11 ms         1.9 ms

Out-of-order key cache lookup          0.003 ms      0.001 ms
Out-of-order key pre-derive (10 keys)  0.9 ms        0.2 ms

At the standard MAVLink update rate of 100 Hz, a message is sent every 10 ms. A full Double Ratchet encrypt on the STM32H7 takes 1.8 ms — 18% of the 10 ms budget for a naive single-threaded implementation. In practice the Swarm SDK runs the ratchet encrypt on a dedicated low-priority thread and double-buffers outgoing messages, so the 1.8 ms compute does not appear on the flight-critical scheduling path. The effective overhead on the flight-controller CPU is under 0.2% of the available compute budget when measured end-to-end across a full mission profile.

The DH ratchet step does not trigger on every message — the SDK advances the X25519 ratchet every 1,000 messages or 60 seconds, whichever comes first. At 100 Hz, this means one ratchet step approximately every 10 seconds. The 1.2 ms ratchet step cost is absorbed into a single message's latency budget and has no effect on throughput.

ML-KEM-768 encapsulation takes 12 ms on the H7 and is performed exactly once per session establishment (not on every DH ratchet step). A mission may establish a handful of sessions at startup. The 12 ms overhead is entirely acceptable for an operation that happens at most a few dozen times per mission. With the STM32H7's hardware AES accelerator enabled, AES-256-GCM drops from 0.15 ms to approximately 0.04 ms — a 3.75× improvement that reduces total full-message encrypt to under 1.4 ms.

Integration with the mesh routing layer

Each ordered pair of nodes in the mesh maintains an independent Double Ratchet session. In a 10-node mesh, each node maintains up to 9 active sessions — one per peer. Sessions are directional: the Alice→Bob session and the Bob→Alice session use separate chain keys and message indices. This is standard Double Ratchet behavior; it ensures that Alice's message key for outbound message 42 is cryptographically unrelated to Bob's message key for his outbound message 42.

The gossip mesh router operates entirely on encrypted payloads. It sees TUNNEL message source and destination fields (system_id, component_id) but not plaintext. Session identifiers embedded in the encrypted header are invisible to the router. This property — the router cannot read the content it forwards — is deliberate. It allows nodes from different organizations to operate in the same mesh without requiring the routing layer to be trusted with plaintext.

Session state — root key, current X25519 ratchet key pair, sending and receiving chain keys, epoch counter, message index, and the skipped-key cache — is stored in the STM32H7's DTCM SRAM, which is accessible only to the privileged CPU and is not DMA-reachable from the radio subsystem. A hardware zeroization watchdog zeroes this region on power-off, hard fault, and watchdog reset. A drone capture that interrupts normal power-down before the watchdog fires will find partially valid session state; this is a known residual risk documented in the threat model.

The DH ratchet renewal policy: a new X25519 ratchet step is triggered on the first message after either 1,000 messages have been sent in the current epoch or 60 seconds have elapsed since the last ratchet step. Both conditions are checked on each outgoing message. The 60-second time bound ensures that even in low-traffic conditions (a node hovering in observation mode sending one heartbeat per second), the ratchet still advances and the post-compromise healing window does not grow unboundedly.

For the full session establishment flow — including the ML-KEM-768 + X25519 hybrid initial key exchange, out-of-band public key distribution, and DID-based peer authentication — see the post-quantum drone mesh architecture article. The Double Ratchet described here operates on top of the established session; the two layers are cleanly separated by interface.


Related technical articles: