Multimodal Safety Is Asymmetric: Cross-Modal Exploits Unlock Black-Box MLLMs Jailbreaks
Xinkai Wang 1, Beibei Li 1, Zerui Shao 1, Ao Liu 1, Guangquan Xu 2, Shouling Ji 3
Published on arXiv
2510.17277
Prompt Injection
OWASP LLM Top 10 — LLM01
Key Finding
PolyJailbreak achieves a jailbreak success rate exceeding 95% on commercial black-box models (GPT-4o, Gemini) with an average 18.15% improvement over state-of-the-art baselines.
PolyJailbreak
Novel technique introduced
Multimodal large language models (MLLMs) have demonstrated significant utility across diverse real-world applications. But MLLMs remain vulnerable to jailbreaks, where adversarial inputs can collapse their safety constraints and trigger unethical responses. In this work, we investigate jailbreaks in the text-vision multimodal setting and pioneer the observation that visual alignment imposes uneven safety constraints across modalities in MLLMs, thereby giving rise to multimodal safety asymmetry. We then develop PolyJailbreak, a black-box jailbreak method grounded in reinforcement learning. Initially, we probe the model's attention dynamics and latent representation space, assessing how visual inputs reshape cross-modal information flow and diminish the model's ability to separate harmful from benign inputs, thereby exposing exploitable vulnerabilities. On this basis, we systematize them into generalizable and reusable operational rules that constitute a structured library of Atomic Strategy Primitives, which translate harmful intents into jailbreak inputs through step-wise transformations. Guided by the primitives, PolyJailbreak employs a multi-agent optimization process that automatically adapts inputs against the target models. We conduct comprehensive evaluations on a variety of open-source and closed-source MLLMs, demonstrating that PolyJailbreak outperforms state-of-the-art baselines.
Key Contributions
- Identifies 'multimodal safety asymmetry' — visual alignment introduces weaker, uneven safety constraints relative to text modality, creating exploitable cross-modal vulnerabilities in MLLMs
- Proposes a structured library of Atomic Strategy Primitives (ASPs) spanning text manipulation, visual manipulation, and prompt amplification that systematize discovered vulnerabilities into reusable jailbreak building blocks
- Develops PolyJailbreak, a reinforcement learning-based multi-agent optimization framework that compositionally assembles ASPs into black-box jailbreak inputs, achieving 18.15% average improvement over SOTA and >95% success rate on GPT-4o and Gemini