defense 2025

A Single Neuron Works: Precise Concept Erasure in Text-to-Image Diffusion Models

Qinqin He , Jiaqi Weng , Jialing Tao , Hui Xue

4 citations · 1 influential · 25 references · arXiv

α

Published on arXiv

2509.21008

Input Manipulation Attack

OWASP ML Top 10 — ML01

Key Finding

SNCE achieves state-of-the-art concept erasure (e.g., nudity, violence) while outperforming prior methods in robustness against adversarial attacks and preserving generation quality for non-target concepts

SNCE (Single Neuron-based Concept Erasure)

Novel technique introduced


Text-to-image models exhibit remarkable capabilities in image generation. However, they also pose safety risks of generating harmful content. A key challenge of existing concept erasure methods is the precise removal of target concepts while minimizing degradation of image quality. In this paper, we propose Single Neuron-based Concept Erasure (SNCE), a novel approach that can precisely prevent harmful content generation by manipulating only a single neuron. Specifically, we train a Sparse Autoencoder (SAE) to map text embeddings into a sparse, disentangled latent space, where individual neurons align tightly with atomic semantic concepts. To accurately locate neurons responsible for harmful concepts, we design a novel neuron identification method based on the modulated frequency scoring of activation patterns. By suppressing activations of the harmful concept-specific neuron, SNCE achieves surgical precision in concept erasure with minimal disruption to image quality. Experiments on various benchmarks demonstrate that SNCE achieves state-of-the-art results in target concept erasure, while preserving the model's generation capabilities for non-target concepts. Additionally, our method exhibits strong robustness against adversarial attacks, significantly outperforming existing methods.


Key Contributions

  • SNCE: a concept erasure method that trains a Sparse Autoencoder on text embeddings to map representations to a sparse, disentangled latent space where individual neurons correspond to atomic semantic concepts
  • A neuron identification method based on modulated frequency scoring of activation patterns, using contrastive concept pairs to filter spurious neurons
  • Demonstrates that suppressing a single concept-specific neuron achieves state-of-the-art erasure precision while preserving generation quality and exhibiting strong robustness against adversarial bypass attacks

🛡️ Threat Analysis

Input Manipulation Attack

The paper explicitly evaluates robustness against adversarial attacks at inference time — adversarial prompts/inputs crafted to bypass the concept erasure mechanism. SNCE is a defense that suppresses concept-specific neurons to be robust against such input manipulation attacks targeting T2I safety filters.


Details

Domains
visionnlpgenerative
Model Types
diffusiontransformer
Threat Tags
inference_timewhite_boxblack_boxdigital
Datasets
I2PCOCORing-A-BellUnlearnDiffAtk benchmark
Applications
text-to-image generationcontent safetyharmful content prevention