attack 2025

Multimodal Prompt Decoupling Attack on the Safety Filters in Text-to-Image Models

Xingkai Peng , Jun Jiang , Meng Tong , Shuai Li , Weiming Zhang , Nenghai Yu , Kejiang Chen

University of Science and Technology of China

1 citations · 54 references · arXiv

Published on arXiv

2509.21360

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

MPDA achieves a 29% higher safety-filter bypass rate than baselines on Midjourney and near-perfect attack success on political misinformation (100%), violent content (94%), and adult content (92%) across Stable Diffusion 3.5 and three commercial T2I models.

MPDA (Multimodal Prompt Decoupling Attack)

Novel technique introduced

Text-to-image (T2I) models have been widely applied in generating high-fidelity images across various domains. However, these models may also be abused to produce Not-Safe-for-Work (NSFW) content via jailbreak attacks. Existing jailbreak methods primarily manipulate the textual prompt, leaving potential vulnerabilities in image-based inputs largely unexplored. Moreover, text-based methods face challenges in bypassing the model's safety filters. In response to these limitations, we propose the Multimodal Prompt Decoupling Attack (MPDA), which utilizes image modality to separate the harmful semantic components of the original unsafe prompt. MPDA follows three core steps: firstly, a large language model (LLM) decouples unsafe prompts into pseudo-safe prompts and harmful prompts. The former are seemingly harmless sub-prompts that can bypass filters, while the latter are sub-prompts with unsafe semantics that trigger filters. Subsequently, the LLM rewrites the harmful prompts into natural adversarial prompts to bypass safety filters, which guide the T2I model to modify the base image into an NSFW output. Finally, to ensure semantic consistency between the generated NSFW images and the original unsafe prompts, the visual language model generates image captions, providing a new pathway to guide the LLM in iterative rewriting and refining the generated content.

Key Contributions

Proposes MPDA, a multimodal jailbreak that decouples unsafe prompts into pseudo-safe text (for base image generation) and harmful text (rewritten by an LLM into natural adversarial prompts), exploiting the image modality to bypass text-only safety filters.
Introduces a VLM-guided iterative refinement loop that generates image captions to verify semantic consistency between generated NSFW outputs and original unsafe prompts.
Demonstrates 29% higher bypass rate than prior methods on Midjourney for pornographic content, and near-perfect attack success on political misinformation (100%), violent content (94%), and adult content (92%) across four unsafe prompt datasets.

🛡️ Threat Analysis

Details

Domains

visionnlpmultimodalgenerative

Model Types

llmvlmdiffusionmultimodal

Threat Tags

black_boxinference_timetargeteddigital

Datasets

I2PMMA-DiffusionRing-A-Bell

Applications

text-to-image generationcontent safety filters

Read PDF arXiv DOI

Multimodal Prompt Decoupling Attack on the Safety Filters in Text-to-Image Models

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

VEIL: Jailbreaking Text-to-Video Models via Visual Exploitation from Implicit Language

$PC^2$: Politically Controversial Content Generation via Jailbreaking Attacks on GPT-based Text-to-Image Models

RunawayEvil: Jailbreaking the Image-to-Video Generative Models

Metaphor-based Jailbreaking Attacks on Text-to-Image Models

Sequential Comics for Jailbreaking Multimodal Large Language Models via Structured Visual Storytelling

TEAR: Temporal-aware Automated Red-teaming for Text-to-Video Models

VRSA: Jailbreaking Multimodal Large Language Models through Visual Reasoning Sequential Attack

Contextual Image Attack: How Visual Context Exposes Multimodal Safety Vulnerabilities