defense 2025

Selective KV-Cache Sharing to Mitigate Timing Side-Channels in LLM Inference

Kexin Chu 1, Zecheng Lin 2,1, Dawei Xiang 1, Zixu Shen 1, Jianchang Su 1, Cheng Chu 3, Yiwei Yang 4, Wenhui Zhang 5, Wenfei Wu 5, Wei Zhang 1

0 citations

α

Published on arXiv

2508.08438

Sensitive Information Disclosure

OWASP LLM Top 10 — LLM06

Key Finding

SafeKV reduces time-to-first-token overhead vs. full isolation by up to 40.58% and raises throughput by up to 2.66x while enforcing cross-tenant privacy in multi-tenant LLM serving.

SafeKV

Novel technique introduced


Global KV-cache sharing is an effective optimization for accelerating large language model (LLM) inference, yet it introduces an API-visible timing side channel that lets adversaries infer sensitive user inputs from shared entries, leading to cross-tenant privacy risks. To address this problem, we introduce SafeKV (Secure and Flexible KV-cache Sharing), a system-level co-design of privacy enforcement and KV-cache management. SafeKV integrates lightweight detection and isolation directly into the serving runtime to eliminate cross-tenant reuse of sensitive KV-cache blocks under our threat model, while recovering most of the performance benefits of global sharing. Our key contributions are: (1) a three-tier asynchronous detection pipeline that decouples privacy classification from inference and supports streaming workloads, (2) a unified radix-tree-based memory manager with path compression and sensitivity-aware eviction for scalable selective isolation, and (3) an RDR-guided (Reuse Diversity Ratio) runtime safeguard that detects and bounds residual leakage. On large LLM backends, SafeKV reduces the time-to-first-token (TTFT) overhead compared to full isolation by up to 40.58% and raises throughput by up to 2.66x. Overall, SafeKV restores the efficiency of KV reuse while enforcing strong, practical privacy for multi-tenant LLM inference.


Key Contributions

  • A three-tier asynchronous detection pipeline that decouples privacy classification from inference and supports streaming workloads
  • A unified radix-tree-based memory manager with path compression and sensitivity-aware eviction for scalable selective KV-cache isolation
  • An RDR-guided (Reuse Diversity Ratio) runtime safeguard that detects and bounds residual leakage, reducing TTFT overhead by up to 40.58% vs. full isolation while raising throughput by up to 2.66x

🛡️ Threat Analysis


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
black_boxinference_time
Applications
multi-tenant llm inference servingllm api services