defense 2026

Data-Free Privacy-Preserving for LLMs via Model Inversion and Selective Unlearning

Xinjie Zhou 1, Zhihui Yang 1, Lechao Cheng 2, Sai Wu 1, Gang Chen 1

0 citations · 25 references · arXiv

α

Published on arXiv

2601.15595

Model Inversion Attack

OWASP ML Top 10 — ML03

Sensitive Information Disclosure

OWASP LLM Top 10 — LLM06

Key Finding

DFSU effectively removes target PII from Pythia models while maintaining model utility, without requiring access to the original training data.

DFSU (Data-Free Selective Unlearning)

Novel technique introduced


Large language models (LLMs) exhibit powerful capabilities but risk memorizing sensitive personally identifiable information (PII) from their training data, posing significant privacy concerns. While machine unlearning techniques aim to remove such data, they predominantly depend on access to the training data. This requirement is often impractical, as training data in real-world deployments is commonly proprietary or inaccessible. To address this limitation, we propose Data-Free Selective Unlearning (DFSU), a novel privacy-preserving framework that removes sensitive PII from an LLM without requiring its training data. Our approach first synthesizes pseudo-PII through language model inversion, then constructs token-level privacy masks for these synthetic samples, and finally performs token-level selective unlearning via a contrastive mask loss within a low-rank adaptation (LoRA) subspace. Extensive experiments on the AI4Privacy PII-Masking dataset using Pythia models demonstrate that our method effectively removes target PII while maintaining model utility.


Key Contributions

  • Data-free selective unlearning (DFSU) framework that removes PII from LLMs without requiring access to original training data
  • Language model inversion to synthesize pseudo-PII tokens representing what the model has memorized
  • Token-level contrastive mask loss within a LoRA subspace for targeted selective unlearning

🛡️ Threat Analysis

Model Inversion Attack

Model inversion is the core technical contribution: the framework synthesizes pseudo-PII by inverting the LLM to recover what sensitive training data it has memorized, without access to the original training corpus. This is directly analogous to a model inversion attack used here as a defensive probe.


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
training_timewhite_box
Datasets
AI4Privacy PII-Masking dataset
Applications
large language modelspii removalprivacy-preserving nlp