defense 2025

SafeGuider: Robust and Practical Content Safety Control for Text-to-Image Models

Peigui Qi 1, Kunsheng Tang 1, Wenbo Zhou 1, Weiming Zhang 1, Nenghai Yu 1, Tianwei Zhang 2, Qing Guo 3, Jie Zhang 3

1 citations · 48 references · CCS

α

Published on arXiv

2510.05173

Input Manipulation Attack

OWASP ML Top 10 — ML01

Key Finding

SafeGuider reduces adversarial prompt attack success rates to a maximum of 5.48% while generating safe, meaningful images instead of refusing or producing black images for blocked prompts.

SafeGuider

Novel technique introduced


Text-to-image models have shown remarkable capabilities in generating high-quality images from natural language descriptions. However, these models are highly vulnerable to adversarial prompts, which can bypass safety measures and produce harmful content. Despite various defensive strategies, achieving robustness against attacks while maintaining practical utility in real-world applications remains a significant challenge. To address this issue, we first conduct an empirical study of the text encoder in the Stable Diffusion (SD) model, which is a widely used and representative text-to-image model. Our findings reveal that the [EOS] token acts as a semantic aggregator, exhibiting distinct distributional patterns between benign and adversarial prompts in its embedding space. Building on this insight, we introduce SafeGuider, a two-step framework designed for robust safety control without compromising generation quality. SafeGuider combines an embedding-level recognition model with a safety-aware feature erasure beam search algorithm. This integration enables the framework to maintain high-quality image generation for benign prompts while ensuring robust defense against both in-domain and out-of-domain attacks. SafeGuider demonstrates exceptional effectiveness in minimizing attack success rates, achieving a maximum rate of only 5.48\% across various attack scenarios. Moreover, instead of refusing to generate or producing black images for unsafe prompts, SafeGuider generates safe and meaningful images, enhancing its practical utility. In addition, SafeGuider is not limited to the SD model and can be effectively applied to other text-to-image models, such as the Flux model, demonstrating its versatility and adaptability across different architectures. We hope that SafeGuider can shed some light on the practical deployment of secure text-to-image systems.


Key Contributions

  • Empirical discovery that the [EOS] token in the Stable Diffusion text encoder acts as a semantic aggregator with distinct distributional patterns separating benign from adversarial prompts
  • SafeGuider: a two-step framework combining an embedding-level recognition model with a safety-aware feature erasure beam search algorithm to block adversarial prompts while maintaining generation quality for benign inputs
  • Demonstrated generalizability across T2I architectures (Stable Diffusion and Flux), achieving a maximum attack success rate of 5.48% across in-domain and out-of-domain attack scenarios

🛡️ Threat Analysis

Input Manipulation Attack

SafeGuider defends against adversarial prompts crafted to evade safety filters in T2I models at inference time — both gradient-based adversarial suffix attacks and out-of-domain evasion inputs. The defense operates at the embedding level (EOS token distribution analysis) and applies feature erasure, directly targeting input manipulation evasion attacks.


Details

Domains
visionnlpgenerative
Model Types
diffusiontransformer
Threat Tags
white_boxblack_boxinference_time
Datasets
I2PLAION
Applications
text-to-image generationcontent safety filtering