Side Effects of Erasing Concepts from Diffusion Models

Concerns about text-to-image (T2I) generative models infringing on privacy, copyright, and safety have led to the development of concept erasure techniques (CETs). The goal of an effective CET is to prohibit the generation of undesired "target" concepts specified by the user, while preserving the ability to synthesize high-quality images of other concepts. In this work, we demonstrate that concept erasure has side effects and CETs can be easily circumvented. For a comprehensive measurement of the robustness of CETs, we present the Side Effect Evaluation (SEE) benchmark that consists of hierarchical and compositional prompts describing objects and their attributes. The dataset and an automated evaluation pipeline quantify side effects of CETs across three aspects: impact on neighboring concepts, evasion of targets, and attribute leakage. Our experiments reveal that CETs can be circumvented by using superclass-subclass hierarchy, semantically similar prompts, and compositional variants of the target. We show that CETs suffer from attribute leakage and a counterintuitive phenomenon of attention concentration or dispersal. We release our benchmark and evaluation tools to aid future work on robust concept erasure.

Key Contributions

SEE benchmark with hierarchical and compositional prompts to systematically measure side effects of concept erasure techniques across three dimensions: neighboring concept impact, target evasion, and attribute leakage
Demonstration that existing CETs are circumventable via superclass-subclass hierarchy, semantically similar prompts, and compositional prompt variants
Discovery of attention concentration/dispersal phenomenon and attribute leakage as counterintuitive failure modes of concept erasure

🛡️ Threat Analysis

Output Integrity Attack

Concept erasure techniques are output-safety mechanisms that govern what content T2I models are allowed to generate. The paper demonstrates failures in these output integrity guarantees (circumvention via superclass-subclass hierarchies, semantically similar prompts, compositional variants) and provides a benchmark to measure how reliably these safety constraints hold — a core output integrity concern for generative models.

Details

Domains

visiongenerative

Model Types

diffusion

Threat Tags

inference_timeblack_box

Datasets

SEE benchmark (introduced in this paper)

Applications

2026 0 cit.

Output Integrity Attack

83%

Side Effects of Erasing Concepts from Diffusion Models

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

When Denoising Becomes Unsigning: Theoretical and Empirical Analysis of Watermark Fragility Under Diffusion-Based Image Editing

Fragile by Design: On the Limits of Adversarial Defenses in Personalized Generation

The Orthogonal Vulnerabilities of Generative AI Watermarks: A Comparative Empirical Benchmark of Spatial and Latent Provenance

Training-free Source Attribution of AI-generated Images via Resynthesis

A Comprehensive Dataset for Human vs. AI Generated Image Detection

WMVLM: Evaluating Diffusion Model Image Watermarking via Vision-Language Models

DiffSeg30k: A Multi-Turn Diffusion Editing Benchmark for Localized AIGC Detection

Identifying Models Behind Text-to-Image Leaderboards