defense 2025

Teleportation-Based Defenses for Privacy in Approximate Machine Unlearning

Mohammad M Maheri 1, Xavier Cadet 2, Peter Chin 2, Hamed Haddadi 1

0 citations · 83 references · arXiv

α

Published on arXiv

2512.00272

Membership Inference Attack

OWASP ML Top 10 — ML04

Model Inversion Attack

OWASP ML Top 10 — ML03

Key Finding

WARP reduces adversarial advantage (AUC) by up to 64% in black-box and 92% in white-box settings across six unlearning algorithms while maintaining retain-set accuracy.

WARP

Novel technique introduced


Approximate machine unlearning aims to efficiently remove the influence of specific data points from a trained model, offering a practical alternative to full retraining. However, it introduces privacy risks: an adversary with access to pre- and post-unlearning models can exploit their differences for membership inference or data reconstruction. We show these vulnerabilities arise from two factors: large gradient norms of forget-set samples and the close proximity of unlearned parameters to the original model. To demonstrate their severity, we propose unlearning-specific membership inference and reconstruction attacks, showing that several state-of-the-art methods (e.g., NGP, SCRUB) remain vulnerable. To mitigate this leakage, we introduce WARP, a plug-and-play teleportation defense that leverages neural network symmetries to reduce forget-set gradient energy and increase parameter dispersion while preserving predictions. This reparameterization obfuscates the signal of forgotten data, making it harder for attackers to distinguish forgotten samples from non-members or recover them via reconstruction. Across six unlearning algorithms, our approach achieves consistent privacy gains, reducing adversarial advantage (AUC) by up to 64% in black-box and 92% in white-box settings, while maintaining accuracy on retained data. These results highlight teleportation as a general tool for reducing attack success in approximate unlearning.


Key Contributions

  • Identifies two root causes of privacy leakage in approximate unlearning: large forget-set gradient norms and close proximity of unlearned parameters to the original model.
  • Proposes novel unlearning-specific membership inference and data reconstruction attacks, showing SOTA methods (NGP, SCRUB) remain vulnerable.
  • Introduces WARP, a plug-and-play teleportation defense using neural network symmetries to reduce forget-set gradient energy and increase parameter dispersion while preserving predictions.

🛡️ Threat Analysis

Model Inversion Attack

Paper proposes data reconstruction attacks (DRA) in the unlearning context where an adversary recovers training data by exploiting differences between pre- and post-unlearning model parameters; the WARP defense directly addresses this by obfuscating forget-set gradient signals.

Membership Inference Attack

Paper proposes unlearning-specific membership inference attacks where an adversary with access to pre- and post-unlearning models determines whether a specific sample was in the forget set; WARP defends against this by reducing adversarial AUC by up to 64% (black-box) and 92% (white-box).


Details

Domains
vision
Model Types
cnntransformer
Threat Tags
white_boxblack_boxtraining_time
Datasets
CIFAR-10ImageNet
Applications
image classificationmachine unlearning