α

Published on arXiv

2508.01272

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Achieves a state-of-the-art unsafe generation rate of 2.36% across multiple T2I model architectures while preserving high benign image fidelity

PromptSafe

Novel technique introduced


Text-to-image (T2I) models have demonstrated remarkable generative capabilities but remain vulnerable to producing not-safe-for-work (NSFW) content, such as violent or explicit imagery. While recent moderation efforts have introduced soft prompt-guided tuning by appending defensive tokens to the input, these approaches often rely on large-scale curated image-text datasets and apply static, one-size-fits-all defenses at inference time. However, this results not only in high computational cost and degraded benign image quality, but also in limited adaptability to the diverse and nuanced safety requirements of real-world prompts. To address these challenges, we propose PromptSafe, a gated prompt tuning framework that combines a lightweight, text-only supervised soft embedding with an inference-time gated control network. Instead of training on expensive image-text datasets, we first rewrite unsafe prompts into semantically aligned but safe alternatives using an LLM, constructing an efficient text-only training corpus. Based on this, we optimize a universal soft prompt that repels unsafe and attracts safe embeddings during the diffusion denoising process. To avoid over-suppressing benign prompts, we introduce a gated mechanism that adaptively adjusts the defensive strength based on estimated prompt toxicity, thereby aligning defense intensity with prompt risk and ensuring strong protection for harmful inputs while preserving benign generation quality. Extensive experiments across multiple benchmarks and T2I models show that PromptSafe achieves a SOTA unsafe generation rate (2.36%), while preserving high benign fidelity. Furthermore, PromptSafe demonstrates strong generalization to unseen harmful categories, robust transferability across diffusion model architectures, and resilience under adaptive adversarial attacks, highlighting its practical value for safe and scalable deployment.


Key Contributions

  • Text-only training pipeline that uses an LLM to rewrite unsafe prompts into safe semantic equivalents, eliminating the need for expensive image-text dataset curation
  • Gated control mechanism that adaptively scales defensive soft prompt strength based on estimated prompt toxicity, preserving benign generation quality while strongly suppressing harmful outputs
  • PromptSafe achieves SOTA unsafe generation rate (2.36%) with demonstrated generalization to unseen harmful categories, cross-architecture transferability, and resilience against adaptive adversarial bypass attacks

🛡️ Threat Analysis


Details

Domains
generative
Model Types
diffusiontransformer
Threat Tags
inference_timetraining_time
Datasets
multiple T2I safety benchmarks
Applications
text-to-image generationnsfw content preventioncontent moderation