defense 2025

PromptSafe: Gated Prompt Tuning for Safe Text-to-Image Generation

0 citations

Published on arXiv

2508.01272

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Achieves a state-of-the-art unsafe generation rate of 2.36% across multiple T2I model architectures while preserving high benign image fidelity

PromptSafe

Novel technique introduced

Text-to-image (T2I) models have demonstrated remarkable generative capabilities but remain vulnerable to producing not-safe-for-work (NSFW) content, such as violent or explicit imagery. While recent moderation efforts have introduced soft prompt-guided tuning by appending defensive tokens to the input, these approaches often rely on large-scale curated image-text datasets and apply static, one-size-fits-all defenses at inference time. However, this results not only in high computational cost and degraded benign image quality, but also in limited adaptability to the diverse and nuanced safety requirements of real-world prompts. To address these challenges, we propose PromptSafe, a gated prompt tuning framework that combines a lightweight, text-only supervised soft embedding with an inference-time gated control network. Instead of training on expensive image-text datasets, we first rewrite unsafe prompts into semantically aligned but safe alternatives using an LLM, constructing an efficient text-only training corpus. Based on this, we optimize a universal soft prompt that repels unsafe and attracts safe embeddings during the diffusion denoising process. To avoid over-suppressing benign prompts, we introduce a gated mechanism that adaptively adjusts the defensive strength based on estimated prompt toxicity, thereby aligning defense intensity with prompt risk and ensuring strong protection for harmful inputs while preserving benign generation quality. Extensive experiments across multiple benchmarks and T2I models show that PromptSafe achieves a SOTA unsafe generation rate (2.36%), while preserving high benign fidelity. Furthermore, PromptSafe demonstrates strong generalization to unseen harmful categories, robust transferability across diffusion model architectures, and resilience under adaptive adversarial attacks, highlighting its practical value for safe and scalable deployment.

Key Contributions

Text-only training pipeline that uses an LLM to rewrite unsafe prompts into safe semantic equivalents, eliminating the need for expensive image-text dataset curation
Gated control mechanism that adaptively scales defensive soft prompt strength based on estimated prompt toxicity, preserving benign generation quality while strongly suppressing harmful outputs
PromptSafe achieves SOTA unsafe generation rate (2.36%) with demonstrated generalization to unseen harmful categories, cross-architecture transferability, and resilience against adaptive adversarial bypass attacks

🛡️ Threat Analysis

Details

Domains

generative

Model Types

diffusiontransformer

Threat Tags

inference_timetraining_time

Datasets

multiple T2I safety benchmarks

Applications

text-to-image generationnsfw content preventioncontent moderation

Read PDF arXiv

PromptSafe: Gated Prompt Tuning for Safe Text-to-Image Generation

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Prompt-Based Safety Guidance Is Ineffective for Unlearned Text-to-Image Diffusion Models

Selective Fine-Tuning for Targeted and Robust Concept Unlearning

Localized Concept Erasure in Text-to-Image Diffusion Models via High-Level Representation Misdirection

PurifyGen: A Risk-Discrimination and Semantic-Purification Model for Safe Text-to-Image Generation

PromptGuard: Soft Prompt-Guided Unsafe Content Moderation for Text-to-Image Models

Beyond the Safety Tax: Mitigating Unsafe Text-to-Image Generation via External Safety Rectification

A2D: Any-Order, Any-Step Safety Alignment for Diffusion Language Models

Closing the Distribution Gap in Adversarial Training for LLMs