Provably Protecting Fine-Tuned LLMs from Training Data Extraction

Fine-tuning large language models (LLMs) on sensitive datasets raises privacy concerns, as training data extraction (TDE) attacks can expose highly confidential information. Existing defenses against such attacks either lack formal privacy guarantees or incur substantial utility degradation. We observe that fine-tuning induces widespread probability shifts, yet preserving only a small subset of influential token-level deviations is sufficient; the remaining shifts can be aggressively smoothed with minimal impact on utility. Motivated by this insight, we propose SCP-$Δ_r$, a Near Access Freeness (NAF)-based algorithm that operates on relative probabilities and explicitly smooths low-impact tokens using a base model. SCP-$Δ_r$ achieves orders-of-magnitude better theoretical bounds than existing NAF based methods and provides strong empirical protection against TDE attacks with minimal performance loss.

Key Contributions

Introduces SpaRPS (Sparse Relative Probability Shift), a formal property capturing the sparsity of fine-tuning-induced relative probability shifts and motivating selective token smoothing
Proposes SCP-Δr, a NAF-based defense using relative probability distributions and base-model smoothing that achieves orders-of-magnitude tighter theoretical TDE bounds than CP-Δ
Demonstrates the first practical extraction attacks that break existing NAF-protected LLMs, then shows SCP-Δr resists them with no observable utility degradation

🛡️ Threat Analysis

Model Inversion Attack

Training data extraction (TDE) attacks are the core threat model — an adversary reconstructs verbatim training sequences from black-box LLM access, which is a direct model inversion / data reconstruction attack. SCP-Δr is proposed as a defense with formal bounds on adversary information gain.

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

black_boxinference_time

Applications

2025 0 cit.

Model Inversion Attack

93%