Neighbor-Aware Localized Concept Erasure in Text-to-Image Diffusion Models
Zhuan Shi 1,2, Alireza Dehghanpour Farashah 3, Rik de Vries 1,2, Golnoosh Farnadi 1,2
Published on arXiv
2603.25994
Output Integrity Attack
OWASP ML Top 10 — ML09
Key Finding
Effectively removes target concepts while better preserving closely related categories in fine-grained domains compared to prior localized erasure methods
NLCE (Neighbor-Aware Localized Concept Erasure)
Novel technique introduced
Concept erasure in text-to-image diffusion models seeks to remove undesired concepts while preserving overall generative capability. Localized erasure methods aim to restrict edits to the spatial region occupied by the target concept. However, we observe that suppressing a concept can unintentionally weaken semantically related neighbor concepts, reducing fidelity in fine-grained domains. We propose Neighbor-Aware Localized Concept Erasure (NLCE), a training-free framework designed to better preserve neighboring concepts while removing target concepts. It operates in three stages: (1) a spectrally-weighted embedding modulation that attenuates target concept directions while stabilizing neighbor concept representations, (2) an attention-guided spatial gate that identifies regions exhibiting residual concept activation, and (3) a spatially-gated hard erasure that eliminates remaining traces only where necessary. This neighbor-aware pipeline enables localized concept removal while maintaining the surrounding concept neighborhood structure. Experiments on fine-grained datasets (Oxford Flowers, Stanford Dogs) show that our method effectively removes target concepts while better preserving closely related categories. Additional results on celebrity identity, explicit content and artistic style demonstrate robustness and generalization to broader erasure scenarios.
Key Contributions
- Neighbor-aware concept erasure framework that preserves semantically related concepts while removing targets
- Three-stage pipeline: spectrally-weighted embedding modulation, attention-guided spatial gating, and spatially-gated hard erasure
- Training-free approach demonstrating effectiveness on fine-grained datasets, celebrity identities, explicit content, and artistic styles
🛡️ Threat Analysis
Focuses on controlling and sanitizing the outputs of generative models by erasing specific concepts (explicit content, celebrity identities, artistic styles) from text-to-image diffusion models. This is fundamentally about output integrity and content control — ensuring models don't generate certain types of content. The paper addresses the problem of removing undesired concepts while maintaining generation quality for related concepts.