Towards Confidential and Efficient LLM Inference with Dual Privacy Protection
Honglan Yu 1,2,3, Yibin Wang 1,2,3, Feifei Dai 1, Dong Liu 1, Haihui Fan 1, Xiaoyan Gu 1,2,3
Published on arXiv
2509.09091
Sensitive Information Disclosure
OWASP LLM Top 10 — LLM06
Key Finding
CMIF reduces TEE-induced inference overhead from 6.5× to 1.54× while preserving user input privacy against a curious inference server.
CMIF (Confidential and efficient Model Inference Framework)
Novel technique introduced
CPU-based trusted execution environments (TEEs) and differential privacy (DP) have gained wide applications for private inference. Due to high inference latency in TEEs, researchers use partition-based approaches that offload linear model components to GPUs. However, dense nonlinear layers of large language models (LLMs) result in significant communication overhead between TEEs and GPUs. DP-based approaches apply random noise to protect data privacy, but this compromises LLM performance and semantic understanding. To overcome the above drawbacks, this paper proposes CMIF, a Confidential and efficient Model Inference Framework. CMIF confidentially deploys the embedding layer in the client-side TEE and subsequent layers on GPU servers. Meanwhile, it optimizes the Report-Noisy-Max mechanism to protect sensitive inputs with a slight decrease in model performance. Extensive experiments on Llama-series models demonstrate that CMIF reduces additional inference overhead in TEEs while preserving user data privacy.
Key Contributions
- CMIF framework that deploys the embedding layer + DP sanitization inside a client-side TEE while offloading remaining LLM computation to GPU servers, reducing TEE inference overhead from 6.5× to 1.54× over baseline.
- Optimized Report-Noisy-Max (RNM) text sanitization mechanism that replaces sensitive tokens with semantically coherent alternatives under differential privacy, outperforming prior DP-based methods at comparable privacy budgets.
- Dual-layer privacy protection combining TEE hardware isolation (for model/sanitizer confidentiality) with DP input obfuscation (for user query privacy), evaluated on Llama-series models.