Practical Secure Inference Algorithm for Fine-tuned Large Language Model Based on Fully Homomorphic Encryption
Zhang Ruoyan , Zheng Zhongxiang , Bao Wankang
Published on arXiv
2501.01672
Model Theft
OWASP ML Top 10 — ML05
Model Theft
OWASP LLM Top 10 — LLM10
Key Finding
The proposed scheme achieves 1.61 s/token inference efficiency on ChatGLM2-6B with LoRA, while provably protecting server LoRA weights against model extraction attacks under LWE hardness.
Private Linear Layer (PLL)
Novel technique introduced
Large language models(LLMs) are currently at the forefront of the machine learning field, which show a broad application prospect but at the same time expose some risks of privacy leakage. We combined Fully Homomorphic Encryption(FHE) and provable security theory with Parameter-Efficient Fine-Tuning(PEFT) to propose an efficient and secure inference scheme for LLMs. More specially, we focus on pre-trained LLMs which rely on open-sourced base model and then fine-tuned with the private datasets by LoRA. This is a popular road-map for Vertical Domain Models such as LawGPT and BenTsao. We use two key technologies below. Firstly, we divide the whole model into the public part and the private part. The weights of public part are publicly accessible(e.g. the open-sourced base model) while the private part needs to be protected(e.g. the LoRA matrices). In this way, the overhead brought by computing on private data can be greatly reduced. Secondly, we propose a general method to transform a linear layer into another one which provides security against model extraction attacks and preserves its original functionality, which denoted as Private Linear Layer(PLL). Then we use this method on the LoRA matrices to make sure that the server protects their private weights without restricting the user's input. We also show that the difficulty of performing model extraction attacks for PLL can be reduced to the well-known hard problem Learning with Errors(LWE). Combing this method with FHE, we can protect user's input at the same time. In this paper, we use the open-source model ChatGLM2-6B as the base model which is fine-tuned by LoRA. Experimental results show the inference efficiency of our scheme reaches 1.61s/token which displays that the scheme has good practicality.
Key Contributions
- Private Linear Layer (PLL): a transformation of LoRA matrices that enables correct inference while making model extraction computationally hard, with security provably reducible to the Learning with Errors (LWE) problem.
- Hybrid public/private model split: the open-source base model runs on the client side while the private LoRA adapter runs on the server, minimizing FHE overhead and achieving 1.61 s/token inference.
- Integration of FHE with PLL to simultaneously protect user input data and server-side fine-tuned weights during inference.
🛡️ Threat Analysis
The core technical contribution — Private Linear Layer (PLL) — transforms LoRA matrices so that model extraction attacks are computationally infeasible, with security formally reduced to the LWE hard problem. This is a direct defense against model theft of the server's private fine-tuned parameters.