ActErase: A Training-Free Paradigm for Precise Concept Erasure via Activation Patching

Recent advances in text-to-image diffusion models have demonstrated remarkable generation capabilities, yet they raise significant concerns regarding safety, copyright, and ethical implications. Existing concept erasure methods address these risks by removing sensitive concepts from pre-trained models, but most of them rely on data-intensive and computationally expensive fine-tuning, which poses a critical limitation. To overcome these challenges, inspired by the observation that the model's activations are predominantly composed of generic concepts, with only a minimal component can represent the target concept, we propose a novel training-free method (ActErase) for efficient concept erasure. Specifically, the proposed method operates by identifying activation difference regions via prompt-pair analysis, extracting target activations and dynamically replacing input activations during forward passes. Comprehensive evaluations across three critical erasure tasks (nudity, artistic style, and object removal) demonstrates that our training-free method achieves state-of-the-art (SOTA) erasure performance, while effectively preserving the model's overall generative capability. Our approach also exhibits strong robustness against adversarial attacks, establishing a new plug-and-play paradigm for lightweight yet effective concept manipulation in diffusion models.

Key Contributions

Training-free concept erasure method (ActErase) using activation patching — eliminates the need for expensive fine-tuning
Identifies activation difference regions via prompt-pair analysis to isolate minimal target-concept components in model activations
Achieves SOTA erasure performance across nudity, artistic style, and object removal tasks while preserving generative quality and demonstrating robustness to adversarial bypass attempts

🛡️ Threat Analysis

Output Integrity Attack

Concept erasure directly controls what content the diffusion model is permitted to output — nudity, copyrighted artistic styles, and specific objects — addressing output integrity by removing unsafe generation capabilities. The paper also explicitly evaluates robustness against adversarial attacks designed to bypass these output-level safety controls.

Details

Domains

visiongenerative

Model Types

diffusion

Threat Tags

white_boxinference_time

Applications

2025 0 cit.

Output Integrity Attack

92%