Rethinking Robust Adversarial Concept Erasure in Diffusion Models
Qinghong Yin 1, Yu Tian 2, Heming Yang 3, Xiang Chen 4, Xianlin Zhang 1, Xueming Li 1, Yue Zhan 1
Published on arXiv
2510.27285
Input Manipulation Attack
OWASP ML Top 10 — ML01
Key Finding
S-GRACE improves concept erasure performance by at least 26%, better preserves non-target concepts, and reduces training time by 90% compared to existing adversarial concept erasure methods.
S-GRACE (Semantics-Guided Robust Adversarial Concept Erasure)
Novel technique introduced
Concept erasure aims to selectively unlearning undesirable content in diffusion models (DMs) to reduce the risk of sensitive content generation. As a novel paradigm in concept erasure, most existing methods employ adversarial training to identify and suppress target concepts, thus reducing the likelihood of sensitive outputs. However, these methods often neglect the specificity of adversarial training in DMs, resulting in only partial mitigation. In this work, we investigate and quantify this specificity from the perspective of concept space, i.e., can adversarial samples truly fit the target concept space? We observe that existing methods neglect the role of conceptual semantics when generating adversarial samples, resulting in ineffective fitting of concept spaces. This oversight leads to the following issues: 1) when there are few adversarial samples, they fail to comprehensively cover the object concept; 2) conversely, they will disrupt other target concept spaces. Motivated by the analysis of these findings, we introduce S-GRACE (Semantics-Guided Robust Adversarial Concept Erasure), which grace leveraging semantic guidance within the concept space to generate adversarial samples and perform erasure training. Experiments conducted with seven state-of-the-art methods and three adversarial prompt generation strategies across various DM unlearning scenarios demonstrate that S-GRACE significantly improves erasure performance 26%, better preserves non-target concepts, and reduces training time by 90%. Our code is available at https://github.com/Qhong-522/S-GRACE.
Key Contributions
- Analysis showing existing adversarial concept erasure methods fail because they neglect conceptual semantics, leading to incomplete concept space coverage or disruption of non-target concepts
- S-GRACE: a semantics-guided adversarial sample generation method that better fits the target concept space during erasure training
- Demonstrated ≥26% improvement in erasure performance and 90% reduction in training time across seven baselines and three adversarial prompt strategies
🛡️ Threat Analysis
The threat model is adversarial prompts at inference time that bypass concept erasure safety mechanisms in diffusion models (evasion attacks against a trained safety system). S-GRACE defends against this by using semantic guidance during adversarial training to comprehensively cover the target concept space, making the erasure robust to these adversarial inputs. The paper's primary contribution is improving robustness against adversarial prompt attacks on DM safety mechanisms.