Privacy-Aware Split Inference with Speculative Decoding for Large Language Models over Wide-Area Networks

We present a practical system for privacy-aware large language model (LLM) inference that splits a transformer between a trusted local GPU and an untrusted cloud GPU, communicating only intermediate activations over the network. Our system addresses the unique challenges of autoregressive LLM decoding over high-latency wide-area networks (WANs), contributing: (1) an asymmetric layer split where embedding and unembedding layers remain local, ensuring raw tokens never leave the trusted device; (2) the first application of lookahead decoding to split inference over WANs, amortizing network round-trip latency across multiple tokens per iteration; (3) an empirical inversion attack evaluation showing that split depth provides a tunable privacy-performance tradeoff -- an attacker can recover ~59%% of tokens at a 2-layer split but only ~35%% at an 8-layer split, with minimal throughput impact; (4) ablation experiments showing that n-gram speculation accepts 1.2-1.3 tokens per decoding step on average (peak of 7 observed on code), with acceptance rates consistent across model scales; (5) formal verification that lookahead decoding produces token-identical output to sequential decoding under greedy argmax, with zero quality degradation; and (6) scaling validation on Mistral NeMo 12B (40 layers), demonstrating that the system generalizes to larger models with only 4.9 GB local VRAM and matching 7B throughput. Evaluated on Mistral 7B and NeMo 12B over a ~80ms WAN link, our system achieves 8.7-9.3 tok/s (7B) and 7.8-8.7 tok/s (12B) with lookahead decoding, with an RTT decomposition model (validated at <6.2%% cross-validation error) projecting 15-19 tok/s at 20ms RTT.

Key Contributions

Asymmetric split inference architecture keeping embedding/unembedding layers local, ensuring raw user tokens never reach the untrusted cloud GPU even if intermediate activations are observed
First integration of lookahead decoding with split inference to amortize WAN round-trip latency, achieving 8.7–9.3 tok/s on Mistral 7B over an ~80ms WAN link
Empirical inversion attack evaluation quantifying token recoverability across split depths (59% at 2-layer vs. 35% at 8-layer), providing a principled privacy-performance tradeoff for deployment decisions

🛡️ Threat Analysis

Model Inversion Attack

The core threat is an adversary (untrusted cloud provider) inverting intermediate transformer activations to recover private user input tokens — directly analogous to embedding inversion included under ML03. The paper empirically evaluates this attack at different split depths and proposes an architectural defense (keeping embedding/unembedding layers local, adjusting split depth) validated against it.

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

grey_boxinference_time

Datasets

Mistral 7BMistral NeMo 12B

Applications

2025 1 cit.

Model Inversion Attack

86%