defense 2025

PromptGuard: Soft Prompt-Guided Unsafe Content Moderation for Text-to-Image Models

Lingzhi Yuan 1, Xinfeng Li 2, Chejian Xu 3, Guanhong Tao 4, Xiaojun Jia 2, Yihao Huang 2, Wei Dong 2, Yang Liu 2, Xiaofeng Wang 2, Bo Li 3

0 citations

α

Published on arXiv

2501.03544

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

PromptGuard reduces the unsafe generation ratio to 5.84% while operating 3.8× faster than prior content moderation methods, surpassing eight state-of-the-art defenses

PromptGuard

Novel technique introduced


Recent text-to-image (T2I) models have exhibited remarkable performance in generating high-quality images from text descriptions. However, these models are vulnerable to misuse, particularly generating not-safe-for-work (NSFW) content, such as sexually explicit, violent, political, and disturbing images, raising serious ethical concerns. In this work, we present PromptGuard, a novel content moderation technique that draws inspiration from the system prompt mechanism in large language models (LLMs) for safety alignment. Unlike LLMs, T2I models lack a direct interface for enforcing behavioral guidelines. Our key idea is to optimize a safety soft prompt that functions as an implicit system prompt within the T2I model's textual embedding space. This universal soft prompt (P*) directly moderates NSFW inputs, enabling safe yet realistic image generation without altering the inference efficiency or requiring proxy models. We further enhance its reliability and helpfulness through a divide-and-conquer strategy, which optimizes category-specific soft prompts and combines them into holistic safety guidance. Extensive experiments across five datasets demonstrate that PromptGuard effectively mitigates NSFW content generation while preserving high-quality benign outputs. PromptGuard achieves 3.8 times faster than prior content moderation methods, surpassing eight state-of-the-art defenses with an optimal unsafe ratio down to 5.84%.


Key Contributions

  • Optimizable safety soft prompt (P*) operating in T2I textual embedding space as an implicit system prompt, requiring no model modification or proxy models
  • Divide-and-conquer strategy that optimizes category-specific soft prompts and merges them into holistic safety guidance
  • 3.8× faster content moderation than prior methods with unsafe ratio reduced to 5.84%, outperforming eight state-of-the-art defenses across five datasets

🛡️ Threat Analysis


Details

Domains
visiongenerative
Model Types
diffusiontransformer
Threat Tags
inference_timeblack_box
Datasets
I2PRing-A-BellMMA-DiffusionUnlearnDiffCOCO
Applications
text-to-image generationnsfw content moderation