attack 2025

Multimodal Prompt Decoupling Attack on the Safety Filters in Text-to-Image Models

Xingkai Peng , Jun Jiang , Meng Tong , Shuai Li , Weiming Zhang , Nenghai Yu , Kejiang Chen

1 citations · 54 references · arXiv

α

Published on arXiv

2509.21360

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

MPDA achieves a 29% higher safety-filter bypass rate than baselines on Midjourney and near-perfect attack success on political misinformation (100%), violent content (94%), and adult content (92%) across Stable Diffusion 3.5 and three commercial T2I models.

MPDA (Multimodal Prompt Decoupling Attack)

Novel technique introduced


Text-to-image (T2I) models have been widely applied in generating high-fidelity images across various domains. However, these models may also be abused to produce Not-Safe-for-Work (NSFW) content via jailbreak attacks. Existing jailbreak methods primarily manipulate the textual prompt, leaving potential vulnerabilities in image-based inputs largely unexplored. Moreover, text-based methods face challenges in bypassing the model's safety filters. In response to these limitations, we propose the Multimodal Prompt Decoupling Attack (MPDA), which utilizes image modality to separate the harmful semantic components of the original unsafe prompt. MPDA follows three core steps: firstly, a large language model (LLM) decouples unsafe prompts into pseudo-safe prompts and harmful prompts. The former are seemingly harmless sub-prompts that can bypass filters, while the latter are sub-prompts with unsafe semantics that trigger filters. Subsequently, the LLM rewrites the harmful prompts into natural adversarial prompts to bypass safety filters, which guide the T2I model to modify the base image into an NSFW output. Finally, to ensure semantic consistency between the generated NSFW images and the original unsafe prompts, the visual language model generates image captions, providing a new pathway to guide the LLM in iterative rewriting and refining the generated content.


Key Contributions

  • Proposes MPDA, a multimodal jailbreak that decouples unsafe prompts into pseudo-safe text (for base image generation) and harmful text (rewritten by an LLM into natural adversarial prompts), exploiting the image modality to bypass text-only safety filters.
  • Introduces a VLM-guided iterative refinement loop that generates image captions to verify semantic consistency between generated NSFW outputs and original unsafe prompts.
  • Demonstrates 29% higher bypass rate than prior methods on Midjourney for pornographic content, and near-perfect attack success on political misinformation (100%), violent content (94%), and adult content (92%) across four unsafe prompt datasets.

🛡️ Threat Analysis


Details

Domains
visionnlpmultimodalgenerative
Model Types
llmvlmdiffusionmultimodal
Threat Tags
black_boxinference_timetargeteddigital
Datasets
I2PMMA-DiffusionRing-A-Bell
Applications
text-to-image generationcontent safety filters