defense 2025

AlignDP: Hybrid Differential Privacy with Rarity-Aware Protection for LLMs

Madhava Gaikwad

0 citations · 9 references · arXiv

α

Published on arXiv

2512.17251

Model Inversion Attack

OWASP ML Top 10 — ML03

Sensitive Information Disclosure

OWASP LLM Top 10 — LLM06

Key Finding

Toy simulation confirms rare categories remain statistically hidden under PAC shielding while frequent categories are recovered with small estimation error under RAPPOR local DP.

AlignDP

Novel technique introduced


Large language models are exposed to risks of extraction, distillation, and unauthorized fine-tuning. Existing defenses use watermarking or monitoring, but these act after leakage. We design AlignDP, a hybrid privacy lock that blocks knowledge transfer at the data interface. The key idea is to separate rare and non-rare fields. Rare fields are shielded by PAC indistinguishability, giving effective zero-epsilon local DP. Non-rare fields are privatized with RAPPOR, giving unbiased frequency estimates under local DP. A global aggregator enforces composition and budget. This two-tier design hides rare events and adds controlled noise to frequent events. We prove limits of PAC extension to global aggregation, give bounds for RAPPOR estimates, and analyze utility trade-off. A toy simulation confirms feasibility: rare categories remain hidden, frequent categories are recovered with small error.


Key Contributions

  • Rarity-aware two-tier privacy model that applies PAC indistinguishability to rare LLM telemetry events and RAPPOR local DP to non-rare events
  • Proof that PAC protection does not compositionally extend to global aggregation, motivating a hybrid DP budget enforcer
  • Theoretical bounds on RAPPOR frequency estimation error and a toy simulation confirming rare-event concealment with low utility loss on non-rare events

🛡️ Threat Analysis

Model Inversion Attack

The paper's adversary model is explicitly an actor issuing repeated queries to reconstruct training data from privatized LLM telemetry. AlignDP is designed to block training data reconstruction by combining PAC indistinguishability for rare events and RAPPOR-based local DP for non-rare events.


Details

Domains
nlp
Model Types
llm
Threat Tags
black_boxtraining_time
Applications
llm telemetry privacyknowledge extraction preventionmodel distillation defense