M-ErasureBench: A Comprehensive Multimodal Evaluation Benchmark for Concept Erasure in Diffusion Models
Ju-Hsuan Weng 1,2, Jia-Wei Liao 1,2, Cheng-Fu Chou 1, Jun-Cheng Chen 2
Published on arXiv
2512.22877
Input Manipulation Attack
OWASP ML Top 10 — ML01
Key Finding
Existing concept erasure methods achieve >90% Concept Reproduction Rate (CRR) under white-box learned-embedding and latent-inversion attacks; IRECE reduces CRR by up to 40% while preserving visual quality.
M-ErasureBench / IRECE
Novel technique introduced
Text-to-image diffusion models may generate harmful or copyrighted content, motivating research on concept erasure. However, existing approaches primarily focus on erasing concepts from text prompts, overlooking other input modalities that are increasingly critical in real-world applications such as image editing and personalized generation. These modalities can become attack surfaces, where erased concepts re-emerge despite defenses. To bridge this gap, we introduce M-ErasureBench, a novel multimodal evaluation framework that systematically benchmarks concept erasure methods across three input modalities: text prompts, learned embeddings, and inverted latents. For the latter two, we evaluate both white-box and black-box access, yielding five evaluation scenarios. Our analysis shows that existing methods achieve strong erasure performance against text prompts but largely fail under learned embeddings and inverted latents, with Concept Reproduction Rate (CRR) exceeding 90% in the white-box setting. To address these vulnerabilities, we propose IRECE (Inference-time Robustness Enhancement for Concept Erasure), a plug-and-play module that localizes target concepts via cross-attention and perturbs the associated latents during denoising. Experiments demonstrate that IRECE consistently restores robustness, reducing CRR by up to 40% under the most challenging white-box latent inversion scenario, while preserving visual quality. To the best of our knowledge, M-ErasureBench provides the first comprehensive benchmark of concept erasure beyond text prompts. Together with IRECE, our benchmark offers practical safeguards for building more reliable protective generative models.
Key Contributions
- M-ErasureBench: the first multimodal benchmark evaluating concept erasure robustness across text prompts, learned embeddings, and inverted latents under both white-box and black-box access (5 evaluation scenarios)
- Empirical finding that existing concept erasure methods largely fail under learned embedding and latent inversion attacks, with Concept Reproduction Rate (CRR) exceeding 90% in white-box settings
- IRECE: a plug-and-play inference-time defense that localizes target concepts via cross-attention and perturbs associated latents during denoising, reducing CRR by up to 40% in the hardest white-box scenario
🛡️ Threat Analysis
The paper studies adversarial input modalities (learned embeddings via textual inversion, inverted latents via DDIM inversion) crafted at inference time to circumvent concept erasure safety mechanisms in diffusion models — analogous to adversarial suffix optimization that bypasses safety filters. IRECE is a defense against these inference-time input manipulation attacks.