defense 2025

Adaptive Token-Weighted Differential Privacy for LLMs: Not All Tokens Require Equal Protection

Manjiang Yu 1, Priyanka Singh 1, Xue Li 1, Yang Cao 2

1 citations · 25 references · arXiv

α

Published on arXiv

2509.23246

Model Inversion Attack

OWASP ML Top 10 — ML03

Sensitive Information Disclosure

OWASP LLM Top 10 — LLM06

Key Finding

ATDP shortens DP fine-tuning time by approximately 90% while achieving comparable or superior canary protection and minimal accuracy degradation relative to state-of-the-art DP-SGD baselines.

ATDP (Adaptive Token-Weighted Differential Privacy)

Novel technique introduced


Large language models (LLMs) frequently memorize sensitive or personal information, raising significant privacy concerns. Existing variants of differential privacy stochastic gradient descent (DPSGD) inject uniform noise into every gradient step, significantly extending training time and reducing model accuracy. We propose that concentrating noise primarily on gradients associated with sensitive tokens can substantially decrease DP training time, strengthen the protection of sensitive information, and simultaneously preserve the model's performance on non-sensitive data. We operationalize this insight through Adaptive Token-Weighted Differential Privacy (ATDP), a modification of vanilla DP-SGD that adaptively assigns different gradient weights to sensitive and non-sensitive tokens. By employing a larger noise scale at the early stage of training, ATDP rapidly disrupts memorization of sensitive content. As a result, ATDP only requires a few additional epochs of lightweight post-processing following standard fine-tuning, injecting targeted noise primarily on parameters corresponding to sensitive tokens, thus minimally affecting the model's general capabilities. ATDP can be seamlessly integrated into any existing DP-based fine-tuning pipeline or directly applied to non-private models as a fast privacy-enhancing measure. Additionally, combined with an initial redacted fine-tuning phase, ATDP forms a streamlined DP pipeline that achieves comparable canary protection to state-of-the-art DP-SGD methods, significantly reduces the computational overhead of DP fine-tuning, shortening training time by approximately 90 percent, while achieving comparable or superior privacy protection and minimal accuracy degradation.


Key Contributions

  • ATDP: a DP-SGD variant that adaptively assigns larger gradient noise to sensitive tokens and standard noise to non-sensitive tokens, reducing unnecessary performance degradation.
  • Two-phase pipeline (redacted fine-tuning + ATDP post-processing) that reduces total DP training time by ~90% compared to full DP-SGD while achieving comparable canary protection.
  • Plug-in design that integrates with any existing DP fine-tuning pipeline or applies directly to non-private models as a lightweight privacy-enhancing post-processing step.

🛡️ Threat Analysis

Model Inversion Attack

Paper directly defends against adversarial extraction of memorized sensitive training data from LLMs, evaluated via canary exposure — the threat model is an adversary reconstructing specific private tokens (PII, passwords, medical data) from model outputs, matching ML03's data-reconstruction attack definition.


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
training_timewhite_box
Applications
llm fine-tuningprivacy-preserving nlpsensitive data protection in language models