defense 2025

Rethinking Robust Adversarial Concept Erasure in Diffusion Models

0 citations · 40 references · arXiv

Published on arXiv

2510.27285

Input Manipulation Attack

OWASP ML Top 10 — ML01

Key Finding

S-GRACE improves concept erasure performance by at least 26%, better preserves non-target concepts, and reduces training time by 90% compared to existing adversarial concept erasure methods.

S-GRACE (Semantics-Guided Robust Adversarial Concept Erasure)

Novel technique introduced

Concept erasure aims to selectively unlearning undesirable content in diffusion models (DMs) to reduce the risk of sensitive content generation. As a novel paradigm in concept erasure, most existing methods employ adversarial training to identify and suppress target concepts, thus reducing the likelihood of sensitive outputs. However, these methods often neglect the specificity of adversarial training in DMs, resulting in only partial mitigation. In this work, we investigate and quantify this specificity from the perspective of concept space, i.e., can adversarial samples truly fit the target concept space? We observe that existing methods neglect the role of conceptual semantics when generating adversarial samples, resulting in ineffective fitting of concept spaces. This oversight leads to the following issues: 1) when there are few adversarial samples, they fail to comprehensively cover the object concept; 2) conversely, they will disrupt other target concept spaces. Motivated by the analysis of these findings, we introduce S-GRACE (Semantics-Guided Robust Adversarial Concept Erasure), which grace leveraging semantic guidance within the concept space to generate adversarial samples and perform erasure training. Experiments conducted with seven state-of-the-art methods and three adversarial prompt generation strategies across various DM unlearning scenarios demonstrate that S-GRACE significantly improves erasure performance 26%, better preserves non-target concepts, and reduces training time by 90%. Our code is available at https://github.com/Qhong-522/S-GRACE.

Key Contributions

Analysis showing existing adversarial concept erasure methods fail because they neglect conceptual semantics, leading to incomplete concept space coverage or disruption of non-target concepts
S-GRACE: a semantics-guided adversarial sample generation method that better fits the target concept space during erasure training
Demonstrated ≥26% improvement in erasure performance and 90% reduction in training time across seven baselines and three adversarial prompt strategies

🛡️ Threat Analysis

Input Manipulation Attack

The threat model is adversarial prompts at inference time that bypass concept erasure safety mechanisms in diffusion models (evasion attacks against a trained safety system). S-GRACE defends against this by using semantic guidance during adversarial training to comprehensively cover the target concept space, making the erasure robust to these adversarial inputs. The paper's primary contribution is improving robustness against adversarial prompt attacks on DM safety mechanisms.

Details

Domains

generativevision

Model Types

diffusion

Threat Tags

inference_timetraining_timedigital

Applications

text-to-image generationcontent safetynsfw content suppression

Read PDF arXiv DOI Code

Rethinking Robust Adversarial Concept Erasure in Diffusion Models

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

AdLift: Lifting Adversarial Perturbations to Safeguard 3D Gaussian Splatting Assets Against Instruction-Driven Editing

SC-Pro: Training-Free Framework for Defending Unsafe Image Synthesis Attack

Architecture-Agnostic Feature Synergy for Universal Defense Against Heterogeneous Generative Threats

EraseFlow: Learning Concept Erasure Policies via GFlowNet-Driven Alignment

Anti-I2V: Safeguarding your photos from malicious image-to-video generation

SIDeR: Semantic Identity Decoupling for Unrestricted Face Privacy

Machine Pareidolia: Protecting Facial Image with Emotional Editing

DeContext as Defense: Safe Image Editing in Diffusion Transformers