Towards Benchmarking Privacy Vulnerabilities in Selective Forgetting with Large Language Models
Wei Qian , Chenxu Zhao , Yangyi Li , Mengdi Huai
Published on arXiv
2512.18035
Model Inversion Attack
OWASP ML Top 10 — ML03
Membership Inference Attack
OWASP ML Top 10 — ML04
Sensitive Information Disclosure
OWASP LLM Top 10 — LLM06
Key Finding
Evaluating 21 attack and defense methods across diverse settings reveals that existing machine unlearning techniques consistently introduce exploitable privacy vulnerabilities — including data reconstruction and membership inference — that are inconsistently reported under prior non-standardized evaluations.
PrivUB
Novel technique introduced
The rapid advancements in artificial intelligence (AI) have primarily focused on the process of learning from data to acquire knowledgeable learning systems. As these systems are increasingly deployed in critical areas, ensuring their privacy and alignment with human values is paramount. Recently, selective forgetting (also known as machine unlearning) has shown promise for privacy and data removal tasks, and has emerged as a transformative paradigm shift in the field of AI. It refers to the ability of a model to selectively erase the influence of previously seen data, which is especially important for compliance with modern data protection regulations and for aligning models with human values. Despite its promise, selective forgetting raises significant privacy concerns, especially when the data involved come from sensitive domains. While new unlearning-induced privacy attacks are continuously proposed, each is shown to outperform its predecessors using different experimental settings, which can lead to overly optimistic and potentially unfair assessments that may disproportionately favor one particular attack over the others. In this work, we present the first comprehensive benchmark for evaluating privacy vulnerabilities in selective forgetting. We extensively investigate privacy vulnerabilities of machine unlearning techniques and benchmark privacy leakage across a wide range of victim data, state-of-the-art unlearning privacy attacks, unlearning methods, and model architectures. We systematically evaluate and identify critical factors related to unlearning-induced privacy leakage. With our novel insights, we aim to provide a standardized tool for practitioners seeking to deploy customized unlearning applications with faithful privacy assessments.
Key Contributions
- PrivUB: the first comprehensive benchmark for evaluating privacy vulnerabilities introduced by machine unlearning, covering 21 attack and defense methods across multiple victim data types, model architectures, and unlearning algorithms.
- A structured taxonomy of unlearning-induced privacy vulnerabilities spanning data reconstruction attacks (DRAs), membership inference attacks, and fine-tuning-reactivation risks, each grounded in a specific threat model.
- Systematic identification of critical empirical factors driving privacy leakage in selective forgetting, providing practitioners with standardized evaluation protocols for faithful privacy assessments.
🛡️ Threat Analysis
Explicitly evaluates data reconstruction attacks (DRAs) that exploit the discrepancy between pre-trained and unlearned models to recover unlearned training data — an adversary reconstructing private training data from model outputs is a core threat model in PrivUB.
Membership inference attacks on unlearning are a primary attack category benchmarked — adversaries determine whether specific data points were part of the unlearning set by comparing model behaviors before and after unlearning.