defense 2026

Localized Concept Erasure in Text-to-Image Diffusion Models via High-Level Representation Misdirection

Uichan Lee , Jeonghyeon Kim , Sangheum Hwang

Seoul National University of Science and Technology

0 citations · 51 references · arXiv (Cornell University)

Published on arXiv

2602.19631

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

HiRM achieves strong, balanced concept erasure across style, object, and NSFW targets on UnlearnCanvas and NSFW benchmarks while remaining robust against adversarial prompt attacks, and transfers to Flux without retraining.

HiRM (High-Level Representation Misdirection)

Novel technique introduced

Recent advances in text-to-image (T2I) diffusion models have seen rapid and widespread adoption. However, their powerful generative capabilities raise concerns about potential misuse for synthesizing harmful, private, or copyrighted content. To mitigate such risks, concept erasure techniques have emerged as a promising solution. Prior works have primarily focused on fine-tuning the denoising component (e.g., the U-Net backbone). However, recent causal tracing studies suggest that visual attribute information is localized in the early self-attention layers of the text encoder, indicating a potential alternative for concept erasing. Building on this insight, we conduct preliminary experiments and find that directly fine-tuning early layers can suppress target concepts but often degrades the generation quality of non-target concepts. To overcome this limitation, we propose High-Level Representation Misdirection (HiRM), which misdirects high-level semantic representations of target concepts in the text encoder toward designated vectors such as random directions or semantically defined directions (e.g., supercategories), while updating only early layers that contain causal states of visual attributes. Our decoupling strategy enables precise concept removal with minimal impact on unrelated concepts, as demonstrated by strong results on UnlearnCanvas and NSFW benchmarks across diverse targets (e.g., objects, styles, nudity). HiRM also preserves generative utility at low training cost, transfers to state-of-the-art architectures such as Flux without additional training, and shows synergistic effects with denoiser-based concept erasing methods.

Key Contributions

HiRM decouples the update location (early CLIP layers containing causal visual attributes) from the erasure target (final-layer high-level semantics), enabling precise concept removal with minimal collateral degradation.
Model-agnostic design: modifications are confined to the shared text encoder, allowing zero-shot transfer to new architectures such as Flux without additional fine-tuning.
Demonstrates synergistic compatibility with denoiser-based concept erasure methods, improving robustness against adversarial prompt attacks (Ring-A-Bell, UnLearnDiffAttack, MMA-Diffusion).

🛡️ Threat Analysis

Details

Domains

generativevision

Model Types

diffusiontransformervlm

Threat Tags

training_timeinference_timeblack_box

Datasets

UnlearnCanvasI2PRing-A-Bell benchmarkMMA-Diffusion benchmark

Applications

text-to-image generationnsfw content filteringcopyright/style erasure

Read PDF arXiv DOI Code

Localized Concept Erasure in Text-to-Image Diffusion Models via High-Level Representation Misdirection

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

PromptGuard: Soft Prompt-Guided Unsafe Content Moderation for Text-to-Image Models

OrthoEraser: Coupled-Neuron Orthogonal Projection for Concept Erasure

Beyond the Safety Tax: Mitigating Unsafe Text-to-Image Generation via External Safety Rectification

PromptSafe: Gated Prompt Tuning for Safe Text-to-Image Generation

ConceptGuard: Proactive Safety in Text-and-Image-to-Video Generation through Multimodal Risk Detection

Conditioned Activation Transport for T2I Safety Steering

Inference-Only Prompt Projection for Safe Text-to-Image Generation with TV Guarantees

Selective Fine-Tuning for Targeted and Robust Concept Unlearning