defense 2025

Now You See It, Now You Don't - Instant Concept Erasure for Safe Text-to-Image and Video Generation

Shristi Das Biswas , Arani Roy , Kaushik Roy

1 citations · 81 references · arXiv

α

Published on arXiv

2511.18684

Output Integrity Attack

OWASP ML Top 10 — ML09

Key Finding

ICE achieves strong concept erasure across four unlearning axes with improved adversarial red-teaming robustness and minimal collateral degradation, applied in approximately 2 seconds without retraining or inference overhead.

ICE (Instant Concept Erasure)

Novel technique introduced


Robust concept removal for text-to-image (T2I) and text-to-video (T2V) models is essential for their safe deployment. Existing methods, however, suffer from costly retraining, inference overhead, or vulnerability to adversarial attacks. Crucially, they rarely model the latent semantic overlap between the target erase concept and surrounding content -- causing collateral damage post-erasure -- and even fewer methods work reliably across both T2I and T2V domains. We introduce Instant Concept Erasure (ICE), a training-free, modality-agnostic, one-shot weight modification approach that achieves precise, persistent unlearning with zero overhead. ICE defines erase and preserve subspaces using anisotropic energy-weighted scaling, then explicitly regularises against their intersection using a unique, closed-form overlap projector. We pose a convex and Lipschitz-bounded Spectral Unlearning Objective, balancing erasure fidelity and intersection preservation, that admits a stable and unique analytical solution. This solution defines a dissociation operator that is translated to the model's text-conditioning layers, making the edit permanent and runtime-free. Across targeted removals of artistic styles, objects, identities, and explicit content, ICE efficiently achieves strong erasure with improved robustness to red-teaming, all while causing only minimal degradation of original generative abilities in both T2I and T2V models.


Key Contributions

  • Training-free, one-shot weight modification (ICE) that permanently erases target concepts from T2I and T2V diffusion models in ~2 seconds with no inference overhead
  • Closed-form Spectral Unlearning Objective with anisotropic energy-weighted subspace operators and an explicit overlap projector that minimizes collateral damage to semantically related preserved concepts
  • Modality-agnostic framework demonstrated across artistic styles, object categories, identities, and explicit content with improved robustness to adversarial red-teaming attacks

🛡️ Threat Analysis

Output Integrity Attack

The paper's primary goal is ensuring output integrity of generative AI models — preventing them from producing explicit content, unauthorized identities, and copyrighted styles. The threat model explicitly includes adversarial red-teaming (adversarial prompts attempting to bypass the safety mechanism), and the defense is evaluated against such attacks. This is a content safety / AI output integrity defense for diffusion-based generative models.


Details

Domains
visiongenerative
Model Types
diffusiontransformer
Threat Tags
white_boxinference_timedigital
Applications
text-to-image generationtext-to-video generationcontent safety