Toward Reliable Machine Unlearning: Theory, Algorithms, and Evaluation

We propose new methodologies for both unlearning random set of samples and class unlearning and show that they outperform existing methods. The main driver of our unlearning methods is the similarity of predictions to a retrained model on both the forget and remain samples. We introduce Adversarial Machine UNlearning (AMUN), which surpasses prior state-of-the-art methods for image classification based on SOTA MIA scores. AMUN lowers the model's confidence on forget samples by fine-tuning on their corresponding adversarial examples. Through theoretical analysis, we identify factors governing AMUN's performance, including smoothness. To facilitate training of smooth models with a controlled Lipschitz constant, we propose FastClip, a scalable method that performs layer-wise spectral-norm clipping of affine layers. In a separate study, we show that increased smoothness naturally improves adversarial example transfer, thereby supporting the second factor above. Following the same principles for class unlearning, we show that existing methods fail in replicating a retrained model's behavior by introducing a nearest-neighbor membership inference attack (MIA-NN) that uses the probabilities assigned to neighboring classes to detect unlearned samples and demonstrate the vulnerability of such methods. We then propose a fine-tuning objective that mitigates this leakage by approximating, for forget-class inputs, the distribution over remaining classes that a model retrained from scratch would produce. To construct this approximation, we estimate inter-class similarity and tilt the target model's distribution accordingly. The resulting Tilted ReWeighting(TRW) distribution serves as the desired target during fine-tuning. Across multiple benchmarks, TRW matches or surpasses existing unlearning methods on prior metrics.

Key Contributions

AMUN (Adversarial Machine UNlearning): fine-tunes on adversarial examples of forget-set samples to reduce model confidence while preserving utility, reducing MIA success to near-random guessing on CIFAR-10.
MIA-NN: a nearest-neighbor membership inference attack exploiting neighboring-class probability distributions that exposes failures of existing class unlearning methods.
TRW (Tilted ReWeighting): a fine-tuning objective approximating the retrained model's remaining-class distribution, reducing the gap to retrained models by 19% (U-LiRA) and 46% (MIA-NN) relative to SOTA on CIFAR-10.

🛡️ Threat Analysis

Membership Inference Attack

The paper's core contributions are evaluated through membership inference attack (MIA) scores: MIA-NN is a novel MIA that detects unlearned samples via neighboring-class probabilities, exposing vulnerabilities in existing class unlearning methods; AMUN and TRW are defenses that reduce MIA success to near-random chance.