Selective KV-Cache Sharing to Mitigate Timing Side-Channels in LLM Inference
Kexin Chu 1, Zecheng Lin 2,1, Dawei Xiang 1, Zixu Shen 1, Jianchang Su 1, Cheng Chu 3, Yiwei Yang 4, Wenhui Zhang 5, Wenfei Wu 5, Wei Zhang 1
Published on arXiv
2508.08438
Sensitive Information Disclosure
OWASP LLM Top 10 — LLM06
Key Finding
SafeKV reduces time-to-first-token overhead vs. full isolation by up to 40.58% and raises throughput by up to 2.66x while enforcing cross-tenant privacy in multi-tenant LLM serving.
SafeKV
Novel technique introduced
Global KV-cache sharing is an effective optimization for accelerating large language model (LLM) inference, yet it introduces an API-visible timing side channel that lets adversaries infer sensitive user inputs from shared entries, leading to cross-tenant privacy risks. To address this problem, we introduce SafeKV (Secure and Flexible KV-cache Sharing), a system-level co-design of privacy enforcement and KV-cache management. SafeKV integrates lightweight detection and isolation directly into the serving runtime to eliminate cross-tenant reuse of sensitive KV-cache blocks under our threat model, while recovering most of the performance benefits of global sharing. Our key contributions are: (1) a three-tier asynchronous detection pipeline that decouples privacy classification from inference and supports streaming workloads, (2) a unified radix-tree-based memory manager with path compression and sensitivity-aware eviction for scalable selective isolation, and (3) an RDR-guided (Reuse Diversity Ratio) runtime safeguard that detects and bounds residual leakage. On large LLM backends, SafeKV reduces the time-to-first-token (TTFT) overhead compared to full isolation by up to 40.58% and raises throughput by up to 2.66x. Overall, SafeKV restores the efficiency of KV reuse while enforcing strong, practical privacy for multi-tenant LLM inference.
Key Contributions
- A three-tier asynchronous detection pipeline that decouples privacy classification from inference and supports streaming workloads
- A unified radix-tree-based memory manager with path compression and sensitivity-aware eviction for scalable selective KV-cache isolation
- An RDR-guided (Reuse Diversity Ratio) runtime safeguard that detects and bounds residual leakage, reducing TTFT overhead by up to 40.58% vs. full isolation while raising throughput by up to 2.66x