defense 2026

ActErase: A Training-Free Paradigm for Precise Concept Erasure via Activation Patching

Yi Sun 1,2, Xinhao Zhong 1, Hongyan Li 1, Yimin Zhou 3, Junhao Li 1, Bin Chen 1,2, Xuan Wang 1,2

1 citations · 43 references · arXiv

α

Published on arXiv

2601.00267

Output Integrity Attack

OWASP ML Top 10 — ML09

Key Finding

Training-free activation patching achieves state-of-the-art concept erasure across nudity, artistic style, and object removal benchmarks while preserving overall generative capability and resisting adversarial attacks

ActErase

Novel technique introduced


Recent advances in text-to-image diffusion models have demonstrated remarkable generation capabilities, yet they raise significant concerns regarding safety, copyright, and ethical implications. Existing concept erasure methods address these risks by removing sensitive concepts from pre-trained models, but most of them rely on data-intensive and computationally expensive fine-tuning, which poses a critical limitation. To overcome these challenges, inspired by the observation that the model's activations are predominantly composed of generic concepts, with only a minimal component can represent the target concept, we propose a novel training-free method (ActErase) for efficient concept erasure. Specifically, the proposed method operates by identifying activation difference regions via prompt-pair analysis, extracting target activations and dynamically replacing input activations during forward passes. Comprehensive evaluations across three critical erasure tasks (nudity, artistic style, and object removal) demonstrates that our training-free method achieves state-of-the-art (SOTA) erasure performance, while effectively preserving the model's overall generative capability. Our approach also exhibits strong robustness against adversarial attacks, establishing a new plug-and-play paradigm for lightweight yet effective concept manipulation in diffusion models.


Key Contributions

  • Training-free concept erasure method (ActErase) using activation patching — eliminates the need for expensive fine-tuning
  • Identifies activation difference regions via prompt-pair analysis to isolate minimal target-concept components in model activations
  • Achieves SOTA erasure performance across nudity, artistic style, and object removal tasks while preserving generative quality and demonstrating robustness to adversarial bypass attempts

🛡️ Threat Analysis

Output Integrity Attack

Concept erasure directly controls what content the diffusion model is permitted to output — nudity, copyrighted artistic styles, and specific objects — addressing output integrity by removing unsafe generation capabilities. The paper also explicitly evaluates robustness against adversarial attacks designed to bypass these output-level safety controls.


Details

Domains
visiongenerative
Model Types
diffusion
Threat Tags
white_boxinference_time
Applications
text-to-image generationcontent safetycopyright protection