defense 2025

CGCE: Classifier-Guided Concept Erasure in Generative Models

Viet Nguyen , Vishal M. Patel

0 citations · 43 references · arXiv

α

Published on arXiv

2511.05865

Input Manipulation Attack

OWASP ML Top 10 — ML01

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

CGCE achieves state-of-the-art robustness against diverse red-teaming attacks on T2I and T2V models while maintaining superior generative quality on benign prompts, without altering original model weights.

CGCE (Classifier-Guided Concept Erasure)

Novel technique introduced


Recent advancements in large-scale generative models have enabled the creation of high-quality images and videos, but have also raised significant safety concerns regarding the generation of unsafe content. To mitigate this, concept erasure methods have been developed to remove undesirable concepts from pre-trained models. However, existing methods remain vulnerable to adversarial attacks that can regenerate the erased content. Moreover, achieving robust erasure often degrades the model's generative quality for safe, unrelated concepts, creating a difficult trade-off between safety and performance. To address this challenge, we introduce Classifier-Guided Concept Erasure (CGCE), an efficient plug-and-play framework that provides robust concept erasure for diverse generative models without altering their original weights. CGCE uses a lightweight classifier operating on text embeddings to first detect and then refine prompts containing undesired concepts. This approach is highly scalable, allowing for multi-concept erasure by aggregating guidance from several classifiers. By modifying only unsafe embeddings at inference time, our method prevents harmful content generation while preserving the model's original quality on benign prompts. Extensive experiments show that CGCE achieves state-of-the-art robustness against a wide range of red-teaming attacks. Our approach also maintains high generative utility, demonstrating a superior balance between safety and performance. We showcase the versatility of CGCE through its successful application to various modern T2I and T2V models, establishing it as a practical and effective solution for safe generative AI.


Key Contributions

  • CGCE: a plug-and-play concept erasure framework that operates on text embeddings at inference time without modifying model weights, enabling safe deployment across diverse T2I and T2V models
  • Lightweight classifier-based prompt detection and refinement pipeline that is scalable to multi-concept erasure by aggregating guidance from multiple classifiers
  • State-of-the-art robustness against a wide range of red-teaming/adversarial attacks while preserving generative quality on benign prompts

🛡️ Threat Analysis

Input Manipulation Attack

The paper's primary threat model is adversarial attacks (including gradient-based adversarial prompts and red-teaming) that manipulate text inputs at inference time to regenerate erased concepts; CGCE defends by classifying and refining text embeddings before they reach the generative model — this is a defense against adversarial input manipulation on generative models.


Details

Domains
visiongenerativemultimodal
Model Types
diffusionvlm
Threat Tags
white_boxblack_boxinference_time
Datasets
I2PCOCO
Applications
text-to-image generationtext-to-video generationcontent safetynsfw filteringartistic style removal