defense 2026

CryptoGen: Secure Transformer Generation with Encrypted KV-Cache Reuse

Hedong Zhang 1, Neusha Javidnia 2, Shweta Pardeshi 2, Qian Lou 1, Farinaz Koushanfar 2

0 citations · 58 references · arXiv (Cornell University)

α

Published on arXiv

2602.08798

Model Theft

OWASP ML Top 10 — ML05

Key Finding

Achieves 4.4x–7.6x lower per-token latency than state-of-the-art discriminative secure inference baselines for input lengths of 128–512 tokens while maintaining near-linear scaling.

CryptoGen

Novel technique introduced


The widespread deployment of cloud-hosted generative models raises a fundamental challenge: enabling efficient autoregressive generation while preserving the privacy of both user prompts and model parameters in untrusted environments. We address this challenge in a client-server setting where an untrusted server hosts an autoregressive Transformer and the client requires cryptographic protection for both inputs and inference. We present CryptoGen, the first system to enable scalable privacy-preserving neural generation with persistent encrypted key-value (KV) cache reuse. Discriminative-task secure inference systems incur quadratic latency and memory growth when adapted to autoregressive decoding due to the lack of native encrypted KV-cache support. In contrast, CryptoGen achieves near-linear scaling by securely reusing and updating encrypted KV caches throughout generation. CryptoGen integrates homomorphic encryption and secret sharing to support both prefilling and generation. Key techniques include a unified encrypted KV-cache framework, heterogeneous SIMD encodings for different phases, optimized cipher-cipher matrix-matrix and matrix-vector operations, and efficient noise refresh and ciphertext concatenation mechanisms. Evaluation on generative Transformer models trained on WikiText-2, PTB, and LAMBADA shows that for input lengths of 128-512 tokens, CryptoGen achieves 4.4x-7.6x lower per-token latency than state-of-the-art discriminative secure inference systems, while maintaining near-linear latency and memory scaling, with advantages increasing for longer sequences. CryptoGen is released as an open-source library.


Key Contributions

  • First system supporting scalable encrypted KV-cache reuse for autoregressive Transformer generation, achieving near-linear (vs. quadratic) latency and memory scaling
  • Unified HE+MPC framework combining heterogeneous SIMD encodings for prefilling and generation phases with optimized cipher-cipher matrix operations
  • Novel noise refresh and ciphertext concatenation mechanisms enabling persistent encrypted KV-cache updates across decoding steps

🛡️ Threat Analysis

Model Theft

CryptoGen explicitly defends against a semi-honest client adversary who could extract model weights/parameters through query access; homomorphic encryption keeps model parameters encrypted throughout inference, directly defending against model IP theft. The instructions confirm HE/MPC inference is ML05-relevant when the threat model involves an adversary trying to extract model IP.


Details

Domains
nlp
Model Types
transformerllm
Threat Tags
inference_timeblack_box
Datasets
WikiText-2PTBLAMBADA
Applications
language model inferenceprivacy-preserving ml inferencecloud-hosted generative ai