Robust Concept Erasure in Diffusion Models: A Theoretical Perspective on Security and Robustness
Zixuan Fu 1, Yan Ren 2, Finn Carter 3, Chenyue Wen 1, Le Ku 2, Daheng Yu 3, Emily Davis 1, Bo Zhang 2
Published on arXiv
2509.12024
Output Integrity Attack
OWASP ML Top 10 — ML09
Key Finding
SCORE achieves up to 12.5% higher erasure efficacy than prior methods while maintaining comparable image quality, backed by provable mutual information-based guarantees against residual concept leakage.
SCORE (Secure and Concept-Oriented Robust Erasure)
Novel technique introduced
Diffusion models have achieved unprecedented success in image generation but pose increasing risks in terms of privacy, fairness, and security. A growing demand exists to \emph{erase} sensitive or harmful concepts (e.g., NSFW content, private individuals, artistic styles) from these models while preserving their overall generative capabilities. We introduce \textbf{SCORE} (Secure and Concept-Oriented Robust Erasure), a novel framework for robust concept removal in diffusion models. SCORE formulates concept erasure as an \emph{adversarial independence} problem, theoretically guaranteeing that the model's outputs become statistically independent of the erased concept. Unlike prior heuristic methods, SCORE minimizes the mutual information between a target concept and generated outputs, yielding provable erasure guarantees. We provide formal proofs establishing convergence properties and derive upper bounds on residual concept leakage. Empirically, we evaluate SCORE on Stable Diffusion and FLUX across four challenging benchmarks: object erasure, NSFW removal, celebrity face suppression, and artistic style unlearning. SCORE consistently outperforms state-of-the-art methods including EraseAnything, ANT, MACE, ESD, and UCE, achieving up to \textbf{12.5\%} higher erasure efficacy while maintaining comparable or superior image quality. By integrating adversarial optimization, trajectory consistency, and saliency-driven fine-tuning, SCORE sets a new standard for secure and robust concept erasure in diffusion models.
Key Contributions
- Formulates concept erasure as an adversarial independence problem, minimizing mutual information between target concepts and generated outputs to yield provable erasure guarantees with formal convergence proofs and residual leakage bounds.
- Integrates adversarial optimization (discriminator-based independence), trajectory consistency, and saliency-guided fine-tuning into a unified framework (SCORE) that is robust to adversarial prompts attempting to recover erased concepts.
- Empirically outperforms prior state-of-the-art erasure methods (EraseAnything, ANT, MACE, ESD, UCE) by up to 12.5% erasure efficacy across four benchmarks on Stable Diffusion and FLUX.
🛡️ Threat Analysis
SCORE's primary contribution is ensuring output integrity of diffusion models — guaranteeing that sensitive concepts (NSFW, celebrity faces, copyrighted styles) cannot appear in model outputs even under adversarial prompting. The provable erasure guarantees via mutual information minimization are fundamentally about controlling what content the model can produce, which maps to output integrity. The paper explicitly defends against adversarial users recovering erased content through indirect prompts, making this a robustness-under-adversarial-conditions output integrity defense.