Universal Adversarial Purification with DDIM Metric Loss for Stable Diffusion
Li Zheng 1,2, Liangbin Xie 1,2, Jiantao Zhou 1, He YiMin 1
Published on arXiv
2601.07253
Output Integrity Attack
OWASP ML Top 10 — ML09
Key Finding
UDAP effectively removes adversarial noise from training images protected by PID, Anti-DreamBooth, MIST, Anti-Diffusion, and MetaCloak, restoring high-quality SD fine-tuning outcomes across multiple SD versions and text prompts.
UDAP (Universal Diffusion Adversarial Purification)
Novel technique introduced
Stable Diffusion (SD) often produces degraded outputs when the training dataset contains adversarial noise. Adversarial purification offers a promising solution by removing adversarial noise from contaminated data. However, existing purification methods are primarily designed for classification tasks and fail to address SD-specific adversarial strategies, such as attacks targeting the VAE encoder, UNet denoiser, or both. To address the gap in SD security, we propose Universal Diffusion Adversarial Purification (UDAP), a novel framework tailored for defending adversarial attacks targeting SD models. UDAP leverages the distinct reconstruction behaviors of clean and adversarial images during Denoising Diffusion Implicit Models (DDIM) inversion to optimize the purification process. By minimizing the DDIM metric loss, UDAP can effectively remove adversarial noise. Additionally, we introduce a dynamic epoch adjustment strategy that adapts optimization iterations based on reconstruction errors, significantly improving efficiency without sacrificing purification quality. Experiments demonstrate UDAP's robustness against diverse adversarial methods, including PID (VAE-targeted), Anti-DreamBooth (UNet-targeted), MIST (hybrid), and robustness-enhanced variants like Anti-Diffusion (Anti-DF) and MetaCloak. UDAP also generalizes well across SD versions and text prompts, showcasing its practical applicability in real-world scenarios.
Key Contributions
- First universal adversarial purification method specifically tailored for Stable Diffusion, defending against VAE-targeted, UNet-targeted, and hybrid adversarial attacks
- DDIM metric loss that exploits the observation that adversarial images yield significantly larger reconstruction errors under DDIM inversion than clean images, enabling noise removal by optimizing the initial latent
- Dynamic epoch adjustment strategy that adapts optimization iterations based on reconstruction error, improving efficiency while maintaining purification quality across mixed clean/adversarial datasets
🛡️ Threat Analysis
UDAP removes adversarial perturbations embedded in images as content protection schemes (Anti-DreamBooth, MIST, PID, MetaCloak) — tools designed to prevent unauthorized AI fine-tuning. The guidelines explicitly categorize 'attacks that REMOVE or DEFEAT image protections via denoising, purification, or other techniques' as ML09 output integrity attacks, not ML01, even when the protections use adversarial perturbations.