UnlearnShield: Shielding Forgotten Privacy against Unlearning Inversion
Lulu Xue 1, Shengshan Hu 1, Wei Lu 1, Ziqi Zhou 1, Yufei Song 1, Jianhong Cheng 1,2, Minghui Li 1, Yanjun Zhang 3, Leo Yu Zhang 4
1 Huazhong University of Science and Technology
2 Institute of Guizhou Aerospace Measuring and Testing Technology
Published on arXiv
2601.20325
Model Inversion Attack
OWASP ML Top 10 — ML03
Key Finding
Achieves a favorable trade-off among privacy protection against inversion attacks, model accuracy, and forgetting efficacy.
UnlearnShield
Novel technique introduced
Machine unlearning is an emerging technique that aims to remove the influence of specific data from trained models, thereby enhancing privacy protection. However, recent research has uncovered critical privacy vulnerabilities, showing that adversaries can exploit unlearning inversion to reconstruct data that was intended to be erased. Despite the severity of this threat, dedicated defenses remain lacking. To address this gap, we propose UnlearnShield, the first defense specifically tailored to counter unlearning inversion. UnlearnShield introduces directional perturbations in the cosine representation space and regulates them through a constraint module to jointly preserve model accuracy and forgetting efficacy, thereby reducing inversion risk while maintaining utility. Experiments demonstrate that it achieves a good trade-off among privacy protection, accuracy, and forgetting.
Key Contributions
- First defense specifically tailored to counter unlearning inversion attacks on machine-unlearned models
- Directional perturbations in cosine representation space that reduce reconstruction risk without degrading model accuracy or forgetting efficacy
- Constraint module that jointly balances privacy protection, model utility, and forgetting quality
🛡️ Threat Analysis
The threat model is an adversary exploiting unlearning inversion to reconstruct training data that was intentionally forgotten — a direct data reconstruction attack. UnlearnShield defends against this by introducing directional perturbations in the cosine representation space to reduce inversion risk while preserving model utility and forgetting efficacy.