defense 2025

SecureInfer: Heterogeneous TEE-GPU Architecture for Privacy-Critical Tensors for Large Language Model Deployment

Tushar Nayan 1, Ziqi Zhang 2, Ruimin Sun 1

1 citations · 27 references · First International Conference...

α

Published on arXiv

2510.19979

Model Theft

OWASP ML Top 10 — ML05

Model Theft

OWASP LLM Top 10 — LLM10

Key Finding

SecureInfer provides strong model extraction resistance on LLaMA-2 with reasonable performance overhead via TEE-GPU hybrid execution.

SecureInfer

Novel technique introduced


With the increasing deployment of Large Language Models (LLMs) on mobile and edge platforms, securing them against model extraction attacks has become a pressing concern. However, protecting model privacy without sacrificing the performance benefits of untrusted AI accelerators, such as GPUs, presents a challenging trade-off. In this paper, we initiate the study of high-performance execution on LLMs and present SecureInfer, a hybrid framework that leverages a heterogeneous Trusted Execution Environments (TEEs)-GPU architecture to isolate privacy-critical components while offloading compute-intensive operations to untrusted accelerators. Building upon an outsourcing scheme, SecureInfer adopts an information-theoretic and threat-informed partitioning strategy: security-sensitive components, including non-linear layers, projection of attention head, FNN transformations, and LoRA adapters, are executed inside an SGX enclave, while other linear operations (matrix multiplication) are performed on the GPU after encryption and are securely restored within the enclave. We implement a prototype of SecureInfer using the LLaMA-2 model and evaluate it across performance and security metrics. Our results show that SecureInfer offers strong security guarantees with reasonable performance, offering a practical solution for secure on-device model inference.


Key Contributions

  • Information-theoretic partitioning strategy that identifies which LLM components (non-linear layers, attention projections, LoRA adapters) are privacy-critical and must execute inside a TEE
  • Hybrid SGX-GPU architecture where matrix multiplications are encrypted and offloaded to the untrusted GPU while security-sensitive operations remain enclave-bound
  • Prototype implementation on LLaMA-2 demonstrating practical performance-security tradeoffs for on-device LLM inference

🛡️ Threat Analysis

Model Theft

SecureInfer is explicitly designed to prevent model extraction attacks — adversaries attempting to steal LLM model weights/parameters. The TEE-based partitioning strategy keeps privacy-critical components (attention projections, FFN transformations, LoRA adapters) inside a trusted enclave to prevent unauthorized reconstruction of the model's intellectual property.


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
black_boxinference_time
Datasets
LLaMA-2
Applications
llm inferenceon-device aiedge deployment