Neighbor-Aware Localized Concept Erasure in Text-to-Image Diffusion Models

Concept erasure in text-to-image diffusion models seeks to remove undesired concepts while preserving overall generative capability. Localized erasure methods aim to restrict edits to the spatial region occupied by the target concept. However, we observe that suppressing a concept can unintentionally weaken semantically related neighbor concepts, reducing fidelity in fine-grained domains. We propose Neighbor-Aware Localized Concept Erasure (NLCE), a training-free framework designed to better preserve neighboring concepts while removing target concepts. It operates in three stages: (1) a spectrally-weighted embedding modulation that attenuates target concept directions while stabilizing neighbor concept representations, (2) an attention-guided spatial gate that identifies regions exhibiting residual concept activation, and (3) a spatially-gated hard erasure that eliminates remaining traces only where necessary. This neighbor-aware pipeline enables localized concept removal while maintaining the surrounding concept neighborhood structure. Experiments on fine-grained datasets (Oxford Flowers, Stanford Dogs) show that our method effectively removes target concepts while better preserving closely related categories. Additional results on celebrity identity, explicit content and artistic style demonstrate robustness and generalization to broader erasure scenarios.

Key Contributions

Neighbor-aware concept erasure framework that preserves semantically related concepts while removing targets
Three-stage pipeline: spectrally-weighted embedding modulation, attention-guided spatial gating, and spatially-gated hard erasure
Training-free approach demonstrating effectiveness on fine-grained datasets, celebrity identities, explicit content, and artistic styles

🛡️ Threat Analysis

Output Integrity Attack

Focuses on controlling and sanitizing the outputs of generative models by erasing specific concepts (explicit content, celebrity identities, artistic styles) from text-to-image diffusion models. This is fundamentally about output integrity and content control — ensuring models don't generate certain types of content. The paper addresses the problem of removing undesired concepts while maintaining generation quality for related concepts.

Details

Domains

visiongenerative

Model Types

diffusion

Threat Tags

inference_time

Datasets

Oxford FlowersStanford Dogs

Applications

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

FIND: A Simple yet Effective Baseline for Diffusion-Generated Image Detection

Gaussian Shannon: High-Precision Diffusion Model Watermarking Based on Communication

ShapeMark: Robust and Diversity-Preserving Watermarking for Diffusion Models

EIRES:Training-free AI-Generated Image Detection via Edit-Induced Reconstruction Error Shift

A Difference-in-Difference Approach to Detecting AI-Generated Images

I2VWM: Robust Watermarking for Image to Video Generation

Proof-of-Authorship for Diffusion-based AI Generated Content

SP-Guard: Selective Prompt-adaptive Guidance for Safe Text-to-Image Generation