attack arXiv Jan 30, 2026 · 9w ago
Manyi Li, Yufan Liu, Lai Jiang et al. · University of the Chinese Academy of Sciences · Chinese Academy of Sciences +2 more
Attacks machine unlearning defenses in diffusion models by optimizing initial latent variables to reactivate erased NSFW knowledge
Input Manipulation Attack visiongenerative
Although unlearning-based defenses claim to purge Not-Safe-For-Work (NSFW) concepts from diffusion models (DMs), we reveals that this "forgetting" is largely an illusion. Unlearning partially disrupts the mapping between linguistic symbols and the underlying knowledge, which remains intact as dormant memories. We find that the distributional discrepancy in the denoising process serves as a measurable indicator of how much of the mapping is retained, also reflecting the strength of unlearning. Inspired by this, we propose IVO (Initial Latent Variable Optimization), a concise and powerful attack framework that reactivates these dormant memories by reconstructing the broken mappings. Through Image Inversion}, Adversarial Optimization and Reused Attack, IVO optimizes initial latent variables to realign the noise distribution of unlearned models with their original unsafe states. Extensive experiments across 8 widely used unlearning techniques demonstrate that IVO achieves superior attack success rates and strong semantic consistency, exposing fundamental flaws in current defenses. The code is available at anonymous.4open.science/r/IVO/. Warning: This paper has unsafe images that may offend some readers.
diffusion University of the Chinese Academy of Sciences · Chinese Academy of Sciences · Beijing University of Aeronautics and Astronautics +1 more