benchmark 2025

Confidential LLM Inference: Performance and Cost Across CPU and GPU TEEs

Marcin Chrapek , Marcin Copik , Etienne Mettaz , Torsten Hoefler

ETH Zurich

4 citations · 1 influential · 76 references · IEEE International Symposium o...

Published on arXiv

2509.18886

Model Theft

OWASP ML Top 10 — ML05

Sensitive Information Disclosure

OWASP LLM Top 10 — LLM06

Key Finding

CPU TEEs (TDX/SGX with AMX) impose under 10% throughput and 20% latency overhead for Llama2 inference, while NVIDIA H100 Confidential Compute GPU TEEs impose only 4–8% throughput penalties that shrink with larger workloads.

Confidential LLM (cLLM) inference via TEEs

Novel technique introduced

Large Language Models (LLMs) are increasingly deployed on converged Cloud and High-Performance Computing (HPC) infrastructure. However, as LLMs handle confidential inputs and are fine-tuned on costly, proprietary datasets, their heightened security requirements slow adoption in privacy-sensitive sectors such as healthcare and finance. We investigate methods to address this gap and propose Trusted Execution Environments (TEEs) as a solution for securing end-to-end LLM inference. We validate their practicality by evaluating these compute-intensive workloads entirely within CPU and GPU TEEs. On the CPU side, we conduct an in-depth study running full Llama2 inference pipelines (7B, 13B, 70B) inside Intel's TDX and SGX, accelerated by Advanced Matrix Extensions (AMX). We derive 12 insights, including that across various data types, batch sizes, and input lengths, CPU TEEs impose under 10% throughput and 20% latency overheads, further reduced by AMX. We run LLM inference on NVIDIA H100 Confidential Compute GPUs, contextualizing our CPU findings and observing throughput penalties of 4-8% that diminish as batch and input sizes grow. By comparing performance, cost, and security trade-offs, we show how CPU TEEs can be more cost-effective or secure than their GPU counterparts. To our knowledge, our work is the first to comprehensively demonstrate the performance and practicality of modern TEEs across both CPUs and GPUs for enabling confidential LLMs (cLLMs).

Key Contributions

First comprehensive performance evaluation of full LLM inference (Llama2 7B/13B/70B) inside both CPU TEEs (Intel TDX and SGX with AMX acceleration) and GPU TEEs (NVIDIA H100 Confidential Compute)
12 derived performance insights showing CPU TEEs impose under 10% throughput and 20% latency overhead, with GPU TEEs showing 4–8% throughput penalties that diminish at larger batch/input sizes
Cost-security trade-off analysis showing CPU TEEs can be more cost-effective or offer stronger security guarantees than GPU TEEs for certain deployment scenarios

🛡️ Threat Analysis

Model Theft

TEEs protect proprietary LLM model weights from being stolen by cloud service providers, cluster administrators, or co-tenants — directly addressing model IP theft during cloud-based inference deployment.

Details

Domains

nlp

Model Types

llm

Threat Tags

inference_time

Datasets

Llama2 7BLlama2 13BLlama2 70B

Applications

llm inferencehealthcare aifinancial aicloud-deployed llms

Read PDF arXiv DOI

Confidential LLM Inference: Performance and Cost Across CPU and GPU TEEs

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

FlexServe: A Fast and Secure LLM Serving System for Mobile Devices with Flexible Resource Isolation

PromptCOS: Towards Content-only System Prompt Copyright Auditing for LLMs

Understanding the Dilemma of Unlearning for Large Language Models

CanaryBench: Stress Testing Privacy Leakage in Cluster-Level Conversation Summaries

The Unlearning Mirage: A Dynamic Framework for Evaluating LLM Unlearning

Topology Matters: Measuring Memory Leakage in Multi-Agent LLMs

Benchmarking Knowledge-Extraction Attack and Defense on Retrieval-Augmented Generation

Query-Efficient Agentic Graph Extraction Attacks on GraphRAG Systems