attack 2025

$δ$-STEAL: LLM Stealing Attack with Local Differential Privacy

Kieu Dang 1, Phung Lai 1, NhatHai Phan 2, Yelong Shen 3, Ruoming Jin 4, Abdallah Khreishah 2

2 citations · 29 references · arXiv

α

Published on arXiv

2510.21946

Model Theft

OWASP ML Top 10 — ML05

Output Integrity Attack

OWASP ML Top 10 — ML09

Model Theft

OWASP LLM Top 10 — LLM10

Key Finding

δ-STEAL achieves up to 96.95% attack success rate while bypassing robust LLM output watermarks across different models, with LDP noise scale controlling the utility-evasion trade-off.

δ-STEAL

Novel technique introduced


Large language models (LLMs) demonstrate remarkable capabilities across various tasks. However, their deployment introduces significant risks related to intellectual property. In this context, we focus on model stealing attacks, where adversaries replicate the behaviors of these models to steal services. These attacks are highly relevant to proprietary LLMs and pose serious threats to revenue and financial stability. To mitigate these risks, the watermarking solution embeds imperceptible patterns in LLM outputs, enabling model traceability and intellectual property verification. In this paper, we study the vulnerability of LLM service providers by introducing $δ$-STEAL, a novel model stealing attack that bypasses the service provider's watermark detectors while preserving the adversary's model utility. $δ$-STEAL injects noise into the token embeddings of the adversary's model during fine-tuning in a way that satisfies local differential privacy (LDP) guarantees. The adversary queries the service provider's model to collect outputs and form input-output training pairs. By applying LDP-preserving noise to these pairs, $δ$-STEAL obfuscates watermark signals, making it difficult for the service provider to determine whether its outputs were used, thereby preventing claims of model theft. Our experiments show that $δ$-STEAL with lightweight modifications achieves attack success rates of up to $96.95\%$ without significantly compromising the adversary's model utility. The noise scale in LDP controls the trade-off between attack effectiveness and model utility. This poses a significant risk, as even robust watermarks can be bypassed, allowing adversaries to deceive watermark detectors and undermine current intellectual property protection methods.


Key Contributions

  • δ-STEAL: a model-agnostic and watermark-agnostic LLM stealing attack that injects LDP-preserving noise into token embeddings during fine-tuning to obfuscate watermark signals in the cloned model's outputs
  • Controllable trade-off between attack effectiveness and model utility via the LDP noise scale δ, achieving up to 96.95% attack success rate with minimal utility degradation
  • Theoretical LDP guarantees showing watermarked outputs become indistinguishable from non-watermarked ones, preventing service providers from proving their outputs were used to train the stolen model

🛡️ Threat Analysis

Model Theft

Core contribution is model stealing: adversary queries a proprietary LLM API to collect input-output pairs, then fine-tunes a local clone — directly replicating the service provider's model IP.

Output Integrity Attack

A primary novel contribution is defeating output watermarks: LDP noise injected into token embeddings obfuscates watermark signals in the stolen model's outputs, bypassing the service provider's watermark detectors and undermining content provenance verification.


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
black_boxtraining_timeinference_time
Applications
llm service apiswatermark-protected language model deployments