defense 2026

Privacy-Aware Split Inference with Speculative Decoding for Large Language Models over Wide-Area Networks

Michael Cunningham

0 citations · 37 references · arXiv (Cornell University)

α

Published on arXiv

2602.16760

Model Inversion Attack

OWASP ML Top 10 — ML03

Sensitive Information Disclosure

OWASP LLM Top 10 — LLM06

Key Finding

Increasing local layers from 2 to 8 reduces inversion-attack token recovery from ~59% to ~35%, while maintaining 8.7–9.3 tok/s throughput over an ~80ms WAN link with minimal performance impact.


We present a practical system for privacy-aware large language model (LLM) inference that splits a transformer between a trusted local GPU and an untrusted cloud GPU, communicating only intermediate activations over the network. Our system addresses the unique challenges of autoregressive LLM decoding over high-latency wide-area networks (WANs), contributing: (1) an asymmetric layer split where embedding and unembedding layers remain local, ensuring raw tokens never leave the trusted device; (2) the first application of lookahead decoding to split inference over WANs, amortizing network round-trip latency across multiple tokens per iteration; (3) an empirical inversion attack evaluation showing that split depth provides a tunable privacy-performance tradeoff -- an attacker can recover ~59%% of tokens at a 2-layer split but only ~35%% at an 8-layer split, with minimal throughput impact; (4) ablation experiments showing that n-gram speculation accepts 1.2-1.3 tokens per decoding step on average (peak of 7 observed on code), with acceptance rates consistent across model scales; (5) formal verification that lookahead decoding produces token-identical output to sequential decoding under greedy argmax, with zero quality degradation; and (6) scaling validation on Mistral NeMo 12B (40 layers), demonstrating that the system generalizes to larger models with only 4.9 GB local VRAM and matching 7B throughput. Evaluated on Mistral 7B and NeMo 12B over a ~80ms WAN link, our system achieves 8.7-9.3 tok/s (7B) and 7.8-8.7 tok/s (12B) with lookahead decoding, with an RTT decomposition model (validated at <6.2%% cross-validation error) projecting 15-19 tok/s at 20ms RTT.


Key Contributions

  • Asymmetric split inference architecture keeping embedding/unembedding layers local, ensuring raw user tokens never reach the untrusted cloud GPU even if intermediate activations are observed
  • First integration of lookahead decoding with split inference to amortize WAN round-trip latency, achieving 8.7–9.3 tok/s on Mistral 7B over an ~80ms WAN link
  • Empirical inversion attack evaluation quantifying token recoverability across split depths (59% at 2-layer vs. 35% at 8-layer), providing a principled privacy-performance tradeoff for deployment decisions

🛡️ Threat Analysis

Model Inversion Attack

The core threat is an adversary (untrusted cloud provider) inverting intermediate transformer activations to recover private user input tokens — directly analogous to embedding inversion included under ML03. The paper empirically evaluates this attack at different split depths and proposes an architectural defense (keeping embedding/unembedding layers local, adjusting split depth) validated against it.


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
grey_boxinference_time
Datasets
Mistral 7BMistral NeMo 12B
Applications
llm inferenceenterprise ai deploymentprivacy-sensitive cloud offloading