defense 2025

Robust Concept Erasure in Diffusion Models: A Theoretical Perspective on Security and Robustness

Zixuan Fu ¹, Yan Ren ², Finn Carter ³, Chenyue Wen ¹, Le Ku ², Daheng Yu ³, Emily Davis ¹, Bo Zhang ²

¹ Nanyang Technological University

² Xidian University

³ Shandong University

0 citations

Published on arXiv

2509.12024

Output Integrity Attack

OWASP ML Top 10 — ML09

Key Finding

SCORE achieves up to 12.5% higher erasure efficacy than prior methods while maintaining comparable image quality, backed by provable mutual information-based guarantees against residual concept leakage.

SCORE (Secure and Concept-Oriented Robust Erasure)

Novel technique introduced

Diffusion models have achieved unprecedented success in image generation but pose increasing risks in terms of privacy, fairness, and security. A growing demand exists to \emph{erase} sensitive or harmful concepts (e.g., NSFW content, private individuals, artistic styles) from these models while preserving their overall generative capabilities. We introduce \textbf{SCORE} (Secure and Concept-Oriented Robust Erasure), a novel framework for robust concept removal in diffusion models. SCORE formulates concept erasure as an \emph{adversarial independence} problem, theoretically guaranteeing that the model's outputs become statistically independent of the erased concept. Unlike prior heuristic methods, SCORE minimizes the mutual information between a target concept and generated outputs, yielding provable erasure guarantees. We provide formal proofs establishing convergence properties and derive upper bounds on residual concept leakage. Empirically, we evaluate SCORE on Stable Diffusion and FLUX across four challenging benchmarks: object erasure, NSFW removal, celebrity face suppression, and artistic style unlearning. SCORE consistently outperforms state-of-the-art methods including EraseAnything, ANT, MACE, ESD, and UCE, achieving up to \textbf{12.5\%} higher erasure efficacy while maintaining comparable or superior image quality. By integrating adversarial optimization, trajectory consistency, and saliency-driven fine-tuning, SCORE sets a new standard for secure and robust concept erasure in diffusion models.

Key Contributions

Formulates concept erasure as an adversarial independence problem, minimizing mutual information between target concepts and generated outputs to yield provable erasure guarantees with formal convergence proofs and residual leakage bounds.
Integrates adversarial optimization (discriminator-based independence), trajectory consistency, and saliency-guided fine-tuning into a unified framework (SCORE) that is robust to adversarial prompts attempting to recover erased concepts.
Empirically outperforms prior state-of-the-art erasure methods (EraseAnything, ANT, MACE, ESD, UCE) by up to 12.5% erasure efficacy across four benchmarks on Stable Diffusion and FLUX.

🛡️ Threat Analysis

Output Integrity Attack

SCORE's primary contribution is ensuring output integrity of diffusion models — guaranteeing that sensitive concepts (NSFW, celebrity faces, copyrighted styles) cannot appear in model outputs even under adversarial prompting. The provable erasure guarantees via mutual information minimization are fundamentally about controlling what content the model can produce, which maps to output integrity. The paper explicitly defends against adversarial users recovering erased content through indirect prompts, making this a robustness-under-adversarial-conditions output integrity defense.

Details

Domains

visiongenerative

Model Types

diffusion

Threat Tags

training_timeinference_timeblack_box

Datasets

Stable Diffusion (object erasure benchmark)FLUX (NSFW removal benchmark)celebrity face benchmarkartistic style unlearning benchmark

Applications

image generationnsfw content removalcelebrity face suppressionartistic style unlearningcontent safety

Read PDF arXiv

Robust Concept Erasure in Diffusion Models: A Theoretical Perspective on Security and Robustness

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Towards Irreversible Machine Unlearning for Diffusion Models

SemBind: Binding Diffusion Watermarks to Semantics Against Black-Box Forgery Attacks

SPDMark: Selective Parameter Displacement for Robust Video Watermarking

NoisePrints: Distortion-Free Watermarks for Authorship in Private Diffusion Models

PoseGuard: Pose-Guided Generation with Safety Guardrails

HMARK: Radioactive Multi-Bit Semantic-Latent Watermarking for Diffusion Models

WaterVIB: Learning Minimal Sufficient Watermark Representations via Variational Information Bottleneck

MOLM: Mixture of LoRA Markers