Sequential Comics for Jailbreaking Multimodal Large Language Models via Structured Visual Storytelling
Deyue Zhang 1, Dongdong Yang 1, Junjie Mu 2, Quanchen Zou 1, Zonghao Ying 3, Wenzhuo Xu 1, Zhao Liu 1, Xuan Wang 1, Xiangzheng Zhang 1
Published on arXiv
2510.15068
Input Manipulation Attack
OWASP ML Top 10 — ML01
Prompt Injection
OWASP LLM Top 10 — LLM01
Key Finding
Achieves 83.5% average attack success rate across 11 state-of-the-art MLLMs including GPT-5, Claude 4 Sonnet, and Gemini 2.5 Pro, surpassing prior SOTA visual jailbreak methods by 46%.
Sequential Comic Jailbreak (SCJ)
Novel technique introduced
Multimodal large language models (MLLMs) exhibit remarkable capabilities but remain susceptible to jailbreak attacks exploiting cross-modal vulnerabilities. In this work, we introduce a novel method that leverages sequential comic-style visual narratives to circumvent safety alignments in state-of-the-art MLLMs. Our method decomposes malicious queries into visually innocuous storytelling elements using an auxiliary LLM, generates corresponding image sequences through diffusion models, and exploits the models' reliance on narrative coherence to elicit harmful outputs. Extensive experiments on harmful textual queries from established safety benchmarks show that our approach achieves an average attack success rate of 83.5\%, surpassing prior state-of-the-art by 46\%. Compared with existing visual jailbreak methods, our sequential narrative strategy demonstrates superior effectiveness across diverse categories of harmful content. We further analyze attack patterns, uncover key vulnerability factors in multimodal safety mechanisms, and evaluate the limitations of current defense strategies against narrative-driven attacks, revealing significant gaps in existing protections.
Key Contributions
- Sequential Comic Jailbreak (SCJ): first attack to exploit narrative comprehension in MLLMs by decomposing malicious queries across diffusion-generated comic panels that are individually innocuous
- Demonstrates that safety alignment asymmetry between visual and textual modalities can be exploited via sequential storytelling, achieving 83.5% average ASR across 11 state-of-the-art MLLMs
- Evaluates failure modes of existing defenses (Llama Guard, LLaVA Guard) against narrative-driven attacks, exposing critical gaps in multimodal safety systems
🛡️ Threat Analysis
Strategically crafted visual inputs (diffusion-generated comic panels) are designed to manipulate MLLM outputs at inference time, exploiting cross-modal vulnerabilities — adversarial content manipulation of a VLM-integrated system.