defense 2026

Neighbor-Aware Localized Concept Erasure in Text-to-Image Diffusion Models

Zhuan Shi 1,2, Alireza Dehghanpour Farashah 3, Rik de Vries 1,2, Golnoosh Farnadi 1,2

0 citations

α

Published on arXiv

2603.25994

Output Integrity Attack

OWASP ML Top 10 — ML09

Key Finding

Effectively removes target concepts while better preserving closely related categories in fine-grained domains compared to prior localized erasure methods

NLCE (Neighbor-Aware Localized Concept Erasure)

Novel technique introduced


Concept erasure in text-to-image diffusion models seeks to remove undesired concepts while preserving overall generative capability. Localized erasure methods aim to restrict edits to the spatial region occupied by the target concept. However, we observe that suppressing a concept can unintentionally weaken semantically related neighbor concepts, reducing fidelity in fine-grained domains. We propose Neighbor-Aware Localized Concept Erasure (NLCE), a training-free framework designed to better preserve neighboring concepts while removing target concepts. It operates in three stages: (1) a spectrally-weighted embedding modulation that attenuates target concept directions while stabilizing neighbor concept representations, (2) an attention-guided spatial gate that identifies regions exhibiting residual concept activation, and (3) a spatially-gated hard erasure that eliminates remaining traces only where necessary. This neighbor-aware pipeline enables localized concept removal while maintaining the surrounding concept neighborhood structure. Experiments on fine-grained datasets (Oxford Flowers, Stanford Dogs) show that our method effectively removes target concepts while better preserving closely related categories. Additional results on celebrity identity, explicit content and artistic style demonstrate robustness and generalization to broader erasure scenarios.


Key Contributions

  • Neighbor-aware concept erasure framework that preserves semantically related concepts while removing targets
  • Three-stage pipeline: spectrally-weighted embedding modulation, attention-guided spatial gating, and spatially-gated hard erasure
  • Training-free approach demonstrating effectiveness on fine-grained datasets, celebrity identities, explicit content, and artistic styles

🛡️ Threat Analysis

Output Integrity Attack

Focuses on controlling and sanitizing the outputs of generative models by erasing specific concepts (explicit content, celebrity identities, artistic styles) from text-to-image diffusion models. This is fundamentally about output integrity and content control — ensuring models don't generate certain types of content. The paper addresses the problem of removing undesired concepts while maintaining generation quality for related concepts.


Details

Domains
visiongenerative
Model Types
diffusion
Threat Tags
inference_time
Datasets
Oxford FlowersStanford Dogs
Applications
text-to-image generationcontent moderationconcept removal