Towards Effective MLLM Jailbreaking Through Balanced On-Topicness and OOD-Intensity
Zuoou Li 1, Weitong Zhang 1, Jingyuan Wang 1, Shuyuan Zhang 1, Wenjia Bai 1, Bernhard Kainz 1,2, Mengyun Qiao 3
Published on arXiv
2508.09218
Input Manipulation Attack
OWASP ML Top 10 — ML01
Prompt Injection
OWASP LLM Top 10 — LLM01
Key Finding
BSD improves jailbreak success rates by 67% and harmfulness scores by 21% over prior methods across 13 commercial and open-source MLLMs including GPT-4o.
BSD (Balanced Structural Decomposition)
Novel technique introduced
Multimodal large language models (MLLMs) are widely used in vision-language reasoning tasks. However, their vulnerability to adversarial prompts remains a serious concern, as safety mechanisms often fail to prevent the generation of harmful outputs. Although recent jailbreak strategies report high success rates, many responses classified as "successful" are actually benign, vague, or unrelated to the intended malicious goal. This mismatch suggests that current evaluation standards may overestimate the effectiveness of such attacks. To address this issue, we introduce a four-axis evaluation framework that considers input on-topicness, input out-of-distribution (OOD) intensity, output harmfulness, and output refusal rate. This framework identifies truly effective jailbreaks. In a substantial empirical study, we reveal a structural trade-off: highly on-topic prompts are frequently blocked by safety filters, whereas those that are too OOD often evade detection but fail to produce harmful content. However, prompts that balance relevance and novelty are more likely to evade filters and trigger dangerous output. Building on this insight, we develop a recursive rewriting strategy called Balanced Structural Decomposition (BSD). The approach restructures malicious prompts into semantically aligned sub-tasks, while introducing subtle OOD signals and visual cues that make the inputs harder to detect. BSD was tested across 13 commercial and open-source MLLMs, where it consistently led to higher attack success rates, more harmful outputs, and fewer refusals. Compared to previous methods, it improves success rates by $67\%$ and harmfulness by $21\%$, revealing a previously underappreciated weakness in current multimodal safety systems.
Key Contributions
- Four-axis evaluation framework (input on-topicness, OOD intensity, output harmfulness, refusal rate) that identifies truly effective vs. superficially 'successful' MLLM jailbreaks
- Discovery of a structural trade-off: highly on-topic prompts are blocked by safety filters, while highly OOD prompts evade but fail to produce harmful content — balanced prompts are most dangerous
- Balanced Structural Decomposition (BSD), a recursive prompt rewriting strategy that embeds subtle OOD signals and visual cues, improving jailbreak success rates by 67% and harmfulness by 21% across 13 MLLMs
🛡️ Threat Analysis
The BSD attack crafts adversarial visual cues alongside restructured prompts to evade MLLM safety filters — this constitutes adversarial input manipulation targeting VLMs at inference time, qualifying for ML01 under the dual-tagging rule for adversarial visual inputs to vision-language models.