AlignDP: Hybrid Differential Privacy with Rarity-Aware Protection for LLMs

Large language models are exposed to risks of extraction, distillation, and unauthorized fine-tuning. Existing defenses use watermarking or monitoring, but these act after leakage. We design AlignDP, a hybrid privacy lock that blocks knowledge transfer at the data interface. The key idea is to separate rare and non-rare fields. Rare fields are shielded by PAC indistinguishability, giving effective zero-epsilon local DP. Non-rare fields are privatized with RAPPOR, giving unbiased frequency estimates under local DP. A global aggregator enforces composition and budget. This two-tier design hides rare events and adds controlled noise to frequent events. We prove limits of PAC extension to global aggregation, give bounds for RAPPOR estimates, and analyze utility trade-off. A toy simulation confirms feasibility: rare categories remain hidden, frequent categories are recovered with small error.

Key Contributions

Rarity-aware two-tier privacy model that applies PAC indistinguishability to rare LLM telemetry events and RAPPOR local DP to non-rare events
Proof that PAC protection does not compositionally extend to global aggregation, motivating a hybrid DP budget enforcer
Theoretical bounds on RAPPOR frequency estimation error and a toy simulation confirming rare-event concealment with low utility loss on non-rare events

🛡️ Threat Analysis

Model Inversion Attack

The paper's adversary model is explicitly an actor issuing repeated queries to reconstruct training data from privatized LLM telemetry. AlignDP is designed to block training data reconstruction by combining PAC indistinguishability for rare events and RAPPOR-based local DP for non-rare events.

Details

Domains

nlp

Model Types

llm

Threat Tags

black_boxtraining_time

Applications

2026 0 cit.

Model Inversion Attack

79%