defense 2025

Bi-Erasing: A Bidirectional Framework for Concept Removal in Diffusion Models

Hao Chen 1, Yiwei Wang 2, Songze Li 1

0 citations · 33 references · arXiv

α

Published on arXiv

2512.13039

Output Integrity Attack

OWASP ML Top 10 — ML09

Key Finding

Bi-Erasing outperforms baseline concept erasure methods in balancing removal effectiveness and visual fidelity by jointly optimizing suppression and safety-enhancement directions.

Bi-Erasing (Bidirectional Image-Guided Concept Erasure)

Novel technique introduced


Concept erasure, which fine-tunes diffusion models to remove undesired or harmful visual concepts, has become a mainstream approach to mitigating unsafe or illegal image generation in text-to-image models.However, existing removal methods typically adopt a unidirectional erasure strategy by either suppressing the target concept or reinforcing safe alternatives, making it difficult to achieve a balanced trade-off between concept removal and generation quality. To address this limitation, we propose a novel Bidirectional Image-Guided Concept Erasure (Bi-Erasing) framework that performs concept suppression and safety enhancement simultaneously. Specifically, based on the joint representation of text prompts and corresponding images, Bi-Erasing introduces two decoupled image branches: a negative branch responsible for suppressing harmful semantics and a positive branch providing visual guidance for safe alternatives. By jointly optimizing these complementary directions, our approach achieves a balance between erasure efficacy and generation usability. In addition, we apply mask-based filtering to the image branches to prevent interference from irrelevant content during the erasure process. Across extensive experiment evaluations, the proposed Bi-Erasing outperforms baseline methods in balancing concept removal effectiveness and visual fidelity.


Key Contributions

  • Bidirectional Image-Guided Concept Erasure (Bi-Erasing) framework that simultaneously suppresses harmful semantics (negative branch) and reinforces safe visual alternatives (positive branch)
  • Joint representation of text prompts and corresponding images to guide concept erasure using decoupled image branches
  • Mask-based filtering to prevent interference from irrelevant content during the erasure process

🛡️ Threat Analysis

Output Integrity Attack

Concept erasure is a defense ensuring the integrity of generative model outputs — specifically preventing diffusion models from generating harmful, unsafe, or illegal visual content. The paper's primary contribution is a method to modify the model's behavior so its outputs do not contain undesired concepts, directly targeting output integrity of AI-generated content.


Details

Domains
visiongenerative
Model Types
diffusion
Threat Tags
training_time
Applications
text-to-image generationcontent moderationsafe image generation