defense 2026

Provably Protecting Fine-Tuned LLMs from Training Data Extraction

Tom Segal , Asaf Shabtai , Yuval Elovici

0 citations · 34 references · arXiv

α

Published on arXiv

2602.00688

Model Inversion Attack

OWASP ML Top 10 — ML03

Sensitive Information Disclosure

OWASP LLM Top 10 — LLM06

Key Finding

SCP-Δr achieves orders-of-magnitude better provable TDE protection bounds than existing NAF-based methods while incurring no observable utility degradation on fine-tuned LLMs.

SCP-Δr

Novel technique introduced


Fine-tuning large language models (LLMs) on sensitive datasets raises privacy concerns, as training data extraction (TDE) attacks can expose highly confidential information. Existing defenses against such attacks either lack formal privacy guarantees or incur substantial utility degradation. We observe that fine-tuning induces widespread probability shifts, yet preserving only a small subset of influential token-level deviations is sufficient; the remaining shifts can be aggressively smoothed with minimal impact on utility. Motivated by this insight, we propose SCP-$Δ_r$, a Near Access Freeness (NAF)-based algorithm that operates on relative probabilities and explicitly smooths low-impact tokens using a base model. SCP-$Δ_r$ achieves orders-of-magnitude better theoretical bounds than existing NAF based methods and provides strong empirical protection against TDE attacks with minimal performance loss.


Key Contributions

  • Introduces SpaRPS (Sparse Relative Probability Shift), a formal property capturing the sparsity of fine-tuning-induced relative probability shifts and motivating selective token smoothing
  • Proposes SCP-Δr, a NAF-based defense using relative probability distributions and base-model smoothing that achieves orders-of-magnitude tighter theoretical TDE bounds than CP-Δ
  • Demonstrates the first practical extraction attacks that break existing NAF-protected LLMs, then shows SCP-Δr resists them with no observable utility degradation

🛡️ Threat Analysis

Model Inversion Attack

Training data extraction (TDE) attacks are the core threat model — an adversary reconstructs verbatim training sequences from black-box LLM access, which is a direct model inversion / data reconstruction attack. SCP-Δr is proposed as a defense with formal bounds on adversary information gain.


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
black_boxinference_time
Applications
llm fine-tuningsensitive data protectiontraining data privacy