attack 2025

Jailbreaking Multimodal Large Language Models via Shuffle Inconsistency

Shiji Zhao ¹, Ranjie Duan ², Fengxiang Wang ², Chi Chen ¹, Caixin Kang ¹, Shouwei Ruan ¹, Jialing Tao ², YueFeng Chen ², Hui Xue ², Xingxing Wei ¹

¹ Beihang University

² Alibaba Group

0 citations

Published on arXiv

2501.04931

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

SI-Attack achieves notably higher attack success rates on commercial closed-source MLLMs like GPT-4o and Claude-3.5-Sonnet compared to prior jailbreak methods, exploiting the gap between comprehension and safety capabilities for shuffled inputs.

SI-Attack

Novel technique introduced

Multimodal Large Language Models (MLLMs) have achieved impressive performance and have been put into practical use in commercial applications, but they still have potential safety mechanism vulnerabilities. Jailbreak attacks are red teaming methods that aim to bypass safety mechanisms and discover MLLMs' potential risks. Existing MLLMs' jailbreak methods often bypass the model's safety mechanism through complex optimization methods or carefully designed image and text prompts. Despite achieving some progress, they have a low attack success rate on commercial closed-source MLLMs. Unlike previous research, we empirically find that there exists a Shuffle Inconsistency between MLLMs' comprehension ability and safety ability for the shuffled harmful instruction. That is, from the perspective of comprehension ability, MLLMs can understand the shuffled harmful text-image instructions well. However, they can be easily bypassed by the shuffled harmful instructions from the perspective of safety ability, leading to harmful responses. Then we innovatively propose a text-image jailbreak attack named SI-Attack. Specifically, to fully utilize the Shuffle Inconsistency and overcome the shuffle randomness, we apply a query-based black-box optimization method to select the most harmful shuffled inputs based on the feedback of the toxic judge model. A series of experiments show that SI-Attack can improve the attack's performance on three benchmarks. In particular, SI-Attack can obviously improve the attack success rate for commercial MLLMs such as GPT-4o or Claude-3.5-Sonnet.

Key Contributions

Discovers 'Shuffle Inconsistency': MLLMs comprehend shuffled harmful text-image instructions but their safety mechanisms fail to block them
Proposes SI-Attack, a query-based black-box jailbreak that uses a toxic judge model to select the most harmful shuffled input configurations
Demonstrates improved attack success rates on commercial closed-source MLLMs (GPT-4o, Claude-3.5-Sonnet) across three benchmarks

🛡️ Threat Analysis

Details

Domains

multimodalnlp

Model Types

vlmllm

Threat Tags

black_boxinference_timetargeted

Datasets

three jailbreak benchmarks (unspecified in abstract)

Applications

multimodal llm safetycommercial mllm apis (gpt-4o, claude-3.5-sonnet)

Read PDF arXiv

Jailbreaking Multimodal Large Language Models via Shuffle Inconsistency

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

JPRO: Automated Multimodal Jailbreaking via Multi-Agent Collaboration Framework

Gaming the Judge: Unfaithful Chain-of-Thought Can Undermine Agent Evaluation

The Emotional Baby Is Truly Deadly: Does your Multimodal Large Reasoning Model Have Emotional Flattery towards Humans?

Text is All You Need for Vision-Language Model Jailbreaking

Multi-Turn Adaptive Prompting Attack on Large Vision-Language Models

Multimodal Safety Is Asymmetric: Cross-Modal Exploits Unlock Black-Box MLLMs Jailbreaks

SafeMT: Multi-turn Safety for Multimodal Language Models

VRSA: Jailbreaking Multimodal Large Language Models through Visual Reasoning Sequential Attack