attack 2026

The Illusion of Forgetting: Attack Unlearned Diffusion via Initial Latent Variable Optimization

0 citations · 31 references · arXiv

Published on arXiv

2602.00175

Input Manipulation Attack

OWASP ML Top 10 — ML01

Key Finding

IVO successfully reactivates NSFW 'dormant memories' in models treated with 8 different unlearning techniques, achieving superior attack success rates and strong semantic consistency, demonstrating that current unlearning-based defenses are fundamentally ineffective.

IVO (Initial Latent Variable Optimization)

Novel technique introduced

Although unlearning-based defenses claim to purge Not-Safe-For-Work (NSFW) concepts from diffusion models (DMs), we reveals that this "forgetting" is largely an illusion. Unlearning partially disrupts the mapping between linguistic symbols and the underlying knowledge, which remains intact as dormant memories. We find that the distributional discrepancy in the denoising process serves as a measurable indicator of how much of the mapping is retained, also reflecting the strength of unlearning. Inspired by this, we propose IVO (Initial Latent Variable Optimization), a concise and powerful attack framework that reactivates these dormant memories by reconstructing the broken mappings. Through Image Inversion}, Adversarial Optimization and Reused Attack, IVO optimizes initial latent variables to realign the noise distribution of unlearned models with their original unsafe states. Extensive experiments across 8 widely used unlearning techniques demonstrate that IVO achieves superior attack success rates and strong semantic consistency, exposing fundamental flaws in current defenses. The code is available at anonymous.4open.science/r/IVO/. Warning: This paper has unsafe images that may offend some readers.

Key Contributions

Theoretical insight that unlearning in diffusion models does not erase unsafe knowledge but only disrupts linguistic-to-knowledge mappings, leaving 'dormant memories' intact
IVO framework using Image Inversion, Adversarial Optimization, and Reused Attack to optimize initial latent variables and realign unlearned model noise distributions with unsafe states
Empirical demonstration of superior attack success rates and semantic consistency across 8 widely used unlearning techniques, exposing fundamental flaws in current defenses

🛡️ Threat Analysis

Input Manipulation Attack

IVO is a white-box adversarial attack that uses gradient-based optimization of initial latent variables (crafted inputs) at inference time to manipulate diffusion model outputs and bypass safety unlearning defenses — directly fitting the core definition of input manipulation via adversarial optimization.

Details

Domains

visiongenerative

Model Types

diffusion

Threat Tags

white_boxinference_timetargeteddigital

Applications

text-to-image generationnsfw content safety in diffusion modelsimage-to-image generation

Read PDF arXiv DOI Code

The Illusion of Forgetting: Attack Unlearned Diffusion via Initial Latent Variable Optimization

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

When World Models Dream Wrong: Physical-Conditioned Adversarial Attacks against World Models

Erased, But Not Forgotten: Erased Rectified Flow Transformers Still Remain Unsafe Under Concept Attack

Immunizing Images from Text to Image Editing via Adversarial Cross-Attention

Revoking Amnesia: RL-based Trajectory Optimization to Resurrect Erased Concepts in Diffusion Models

LURE: Latent Space Unblocking for Multi-Concept Reawakening in Diffusion Models

DPAC: Distribution-Preserving Adversarial Control for Diffusion Sampling

CtrlAttack: A Unified Attack on World-Model Control in Diffusion Models

Arc2Morph: Identity-Preserving Facial Morphing with Arc2Face