α

Published on arXiv

2510.27285

Input Manipulation Attack

OWASP ML Top 10 — ML01

Key Finding

S-GRACE improves concept erasure performance by at least 26%, better preserves non-target concepts, and reduces training time by 90% compared to existing adversarial concept erasure methods.

S-GRACE (Semantics-Guided Robust Adversarial Concept Erasure)

Novel technique introduced


Concept erasure aims to selectively unlearning undesirable content in diffusion models (DMs) to reduce the risk of sensitive content generation. As a novel paradigm in concept erasure, most existing methods employ adversarial training to identify and suppress target concepts, thus reducing the likelihood of sensitive outputs. However, these methods often neglect the specificity of adversarial training in DMs, resulting in only partial mitigation. In this work, we investigate and quantify this specificity from the perspective of concept space, i.e., can adversarial samples truly fit the target concept space? We observe that existing methods neglect the role of conceptual semantics when generating adversarial samples, resulting in ineffective fitting of concept spaces. This oversight leads to the following issues: 1) when there are few adversarial samples, they fail to comprehensively cover the object concept; 2) conversely, they will disrupt other target concept spaces. Motivated by the analysis of these findings, we introduce S-GRACE (Semantics-Guided Robust Adversarial Concept Erasure), which grace leveraging semantic guidance within the concept space to generate adversarial samples and perform erasure training. Experiments conducted with seven state-of-the-art methods and three adversarial prompt generation strategies across various DM unlearning scenarios demonstrate that S-GRACE significantly improves erasure performance 26%, better preserves non-target concepts, and reduces training time by 90%. Our code is available at https://github.com/Qhong-522/S-GRACE.


Key Contributions

  • Analysis showing existing adversarial concept erasure methods fail because they neglect conceptual semantics, leading to incomplete concept space coverage or disruption of non-target concepts
  • S-GRACE: a semantics-guided adversarial sample generation method that better fits the target concept space during erasure training
  • Demonstrated ≥26% improvement in erasure performance and 90% reduction in training time across seven baselines and three adversarial prompt strategies

🛡️ Threat Analysis

Input Manipulation Attack

The threat model is adversarial prompts at inference time that bypass concept erasure safety mechanisms in diffusion models (evasion attacks against a trained safety system). S-GRACE defends against this by using semantic guidance during adversarial training to comprehensively cover the target concept space, making the erasure robust to these adversarial inputs. The paper's primary contribution is improving robustness against adversarial prompt attacks on DM safety mechanisms.


Details

Domains
generativevision
Model Types
diffusion
Threat Tags
inference_timetraining_timedigital
Applications
text-to-image generationcontent safetynsfw content suppression