defense 2026

SafeRedir: Prompt Embedding Redirection for Robust Unlearning in Image Generation Models

Renyang Liu 1, Kangjie Chen 2, Han Qiu 3, Jie Zhang 4, Kwok-Yan Lam 2, Tianwei Zhang 2, See-Kiong Ng 1

1 citations · 53 references · arXiv

α

Published on arXiv

2601.08623

Input Manipulation Attack

OWASP ML Top 10 — ML01

Key Finding

SafeRedir achieves effective concept unlearning with enhanced resistance to adversarial prompt attacks and prompt paraphrasing while preserving image quality, generalizing plug-and-play across multiple diffusion backbones without model modification

SafeRedir

Novel technique introduced


Image generation models (IGMs), while capable of producing impressive and creative content, often memorize a wide range of undesirable concepts from their training data, leading to the reproduction of unsafe content such as NSFW imagery and copyrighted artistic styles. Such behaviors pose persistent safety and compliance risks in real-world deployments and cannot be reliably mitigated by post-hoc filtering, owing to the limited robustness of such mechanisms and a lack of fine-grained semantic control. Recent unlearning methods seek to erase harmful concepts at the model level, which exhibit the limitations of requiring costly retraining, degrading the quality of benign generations, or failing to withstand prompt paraphrasing and adversarial attacks. To address these challenges, we introduce SafeRedir, a lightweight inference-time framework for robust unlearning via prompt embedding redirection. Without modifying the underlying IGMs, SafeRedir adaptively routes unsafe prompts toward safe semantic regions through token-level interventions in the embedding space. The framework comprises two core components: a latent-aware multi-modal safety classifier for identifying unsafe generation trajectories, and a token-level delta generator for precise semantic redirection, equipped with auxiliary predictors for token masking and adaptive scaling to localize and regulate the intervention. Empirical results across multiple representative unlearning tasks demonstrate that SafeRedir achieves effective unlearning capability, high semantic and perceptual preservation, robust image quality, and enhanced resistance to adversarial attacks. Furthermore, SafeRedir generalizes effectively across a variety of diffusion backbones and existing unlearned models, validating its plug-and-play compatibility and broad applicability. Code and data are available at https://github.com/ryliu68/SafeRedir.


Key Contributions

  • Lightweight inference-time framework (SafeRedir) that redirects unsafe prompts via token-level embedding interventions without modifying the underlying image generation model weights
  • Latent-aware multi-modal safety classifier for identifying unsafe generation trajectories before image synthesis
  • Token-level delta generator with auxiliary predictors for adaptive token masking and scaling to achieve precise semantic redirection toward safe regions

🛡️ Threat Analysis

Input Manipulation Attack

SafeRedir is a defense against adversarial prompt attacks — inputs crafted (via paraphrasing or gradient-based optimization) to bypass safety mechanisms in image generation models at inference time. It functions as an input-side purification/redirection defense, intercepting unsafe prompts in embedding space before generation, and is explicitly evaluated for robustness against adversarial attacks. This fits the 'input purification / adversarial detection' defense paradigm under ML01.


Details

Domains
visiongenerative
Model Types
diffusiontransformer
Threat Tags
inference_timewhite_boxblack_box
Applications
image generationnsfw content preventioncopyright artistic style protection