Towards Benchmarking Privacy Vulnerabilities in Selective Forgetting with Large Language Models

The rapid advancements in artificial intelligence (AI) have primarily focused on the process of learning from data to acquire knowledgeable learning systems. As these systems are increasingly deployed in critical areas, ensuring their privacy and alignment with human values is paramount. Recently, selective forgetting (also known as machine unlearning) has shown promise for privacy and data removal tasks, and has emerged as a transformative paradigm shift in the field of AI. It refers to the ability of a model to selectively erase the influence of previously seen data, which is especially important for compliance with modern data protection regulations and for aligning models with human values. Despite its promise, selective forgetting raises significant privacy concerns, especially when the data involved come from sensitive domains. While new unlearning-induced privacy attacks are continuously proposed, each is shown to outperform its predecessors using different experimental settings, which can lead to overly optimistic and potentially unfair assessments that may disproportionately favor one particular attack over the others. In this work, we present the first comprehensive benchmark for evaluating privacy vulnerabilities in selective forgetting. We extensively investigate privacy vulnerabilities of machine unlearning techniques and benchmark privacy leakage across a wide range of victim data, state-of-the-art unlearning privacy attacks, unlearning methods, and model architectures. We systematically evaluate and identify critical factors related to unlearning-induced privacy leakage. With our novel insights, we aim to provide a standardized tool for practitioners seeking to deploy customized unlearning applications with faithful privacy assessments.

Key Contributions

PrivUB: the first comprehensive benchmark for evaluating privacy vulnerabilities introduced by machine unlearning, covering 21 attack and defense methods across multiple victim data types, model architectures, and unlearning algorithms.
A structured taxonomy of unlearning-induced privacy vulnerabilities spanning data reconstruction attacks (DRAs), membership inference attacks, and fine-tuning-reactivation risks, each grounded in a specific threat model.
Systematic identification of critical empirical factors driving privacy leakage in selective forgetting, providing practitioners with standardized evaluation protocols for faithful privacy assessments.

🛡️ Threat Analysis

Model Inversion Attack

Explicitly evaluates data reconstruction attacks (DRAs) that exploit the discrepancy between pre-trained and unlearned models to recover unlearned training data — an adversary reconstructing private training data from model outputs is a core threat model in PrivUB.

Membership Inference Attack

Membership inference attacks on unlearning are a primary attack category benchmarked — adversaries determine whether specific data points were part of the unlearning set by comparing model behaviors before and after unlearning.