Provably Protecting Fine-Tuned LLMs from Training Data Extraction
Tom Segal , Asaf Shabtai , Yuval Elovici
Published on arXiv
2602.00688
Model Inversion Attack
OWASP ML Top 10 — ML03
Sensitive Information Disclosure
OWASP LLM Top 10 — LLM06
Key Finding
SCP-Δr achieves orders-of-magnitude better provable TDE protection bounds than existing NAF-based methods while incurring no observable utility degradation on fine-tuned LLMs.
SCP-Δr
Novel technique introduced
Fine-tuning large language models (LLMs) on sensitive datasets raises privacy concerns, as training data extraction (TDE) attacks can expose highly confidential information. Existing defenses against such attacks either lack formal privacy guarantees or incur substantial utility degradation. We observe that fine-tuning induces widespread probability shifts, yet preserving only a small subset of influential token-level deviations is sufficient; the remaining shifts can be aggressively smoothed with minimal impact on utility. Motivated by this insight, we propose SCP-$Δ_r$, a Near Access Freeness (NAF)-based algorithm that operates on relative probabilities and explicitly smooths low-impact tokens using a base model. SCP-$Δ_r$ achieves orders-of-magnitude better theoretical bounds than existing NAF based methods and provides strong empirical protection against TDE attacks with minimal performance loss.
Key Contributions
- Introduces SpaRPS (Sparse Relative Probability Shift), a formal property capturing the sparsity of fine-tuning-induced relative probability shifts and motivating selective token smoothing
- Proposes SCP-Δr, a NAF-based defense using relative probability distributions and base-model smoothing that achieves orders-of-magnitude tighter theoretical TDE bounds than CP-Δ
- Demonstrates the first practical extraction attacks that break existing NAF-protected LLMs, then shows SCP-Δr resists them with no observable utility degradation
🛡️ Threat Analysis
Training data extraction (TDE) attacks are the core threat model — an adversary reconstructs verbatim training sequences from black-box LLM access, which is a direct model inversion / data reconstruction attack. SCP-Δr is proposed as a defense with formal bounds on adversary information gain.