attack 2025

Multimodal Safety Is Asymmetric: Cross-Modal Exploits Unlock Black-Box MLLMs Jailbreaks

Xinkai Wang 1, Beibei Li 1, Zerui Shao 1, Ao Liu 1, Guangquan Xu 2, Shouling Ji 3

1 citations · 44 references · arXiv

α

Published on arXiv

2510.17277

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

PolyJailbreak achieves a jailbreak success rate exceeding 95% on commercial black-box models (GPT-4o, Gemini) with an average 18.15% improvement over state-of-the-art baselines.

PolyJailbreak

Novel technique introduced


Multimodal large language models (MLLMs) have demonstrated significant utility across diverse real-world applications. But MLLMs remain vulnerable to jailbreaks, where adversarial inputs can collapse their safety constraints and trigger unethical responses. In this work, we investigate jailbreaks in the text-vision multimodal setting and pioneer the observation that visual alignment imposes uneven safety constraints across modalities in MLLMs, thereby giving rise to multimodal safety asymmetry. We then develop PolyJailbreak, a black-box jailbreak method grounded in reinforcement learning. Initially, we probe the model's attention dynamics and latent representation space, assessing how visual inputs reshape cross-modal information flow and diminish the model's ability to separate harmful from benign inputs, thereby exposing exploitable vulnerabilities. On this basis, we systematize them into generalizable and reusable operational rules that constitute a structured library of Atomic Strategy Primitives, which translate harmful intents into jailbreak inputs through step-wise transformations. Guided by the primitives, PolyJailbreak employs a multi-agent optimization process that automatically adapts inputs against the target models. We conduct comprehensive evaluations on a variety of open-source and closed-source MLLMs, demonstrating that PolyJailbreak outperforms state-of-the-art baselines.


Key Contributions

  • Identifies 'multimodal safety asymmetry' — visual alignment introduces weaker, uneven safety constraints relative to text modality, creating exploitable cross-modal vulnerabilities in MLLMs
  • Proposes a structured library of Atomic Strategy Primitives (ASPs) spanning text manipulation, visual manipulation, and prompt amplification that systematize discovered vulnerabilities into reusable jailbreak building blocks
  • Develops PolyJailbreak, a reinforcement learning-based multi-agent optimization framework that compositionally assembles ASPs into black-box jailbreak inputs, achieving 18.15% average improvement over SOTA and >95% success rate on GPT-4o and Gemini

🛡️ Threat Analysis


Details

Domains
nlpmultimodal
Model Types
llmvlmmultimodal
Threat Tags
black_boxinference_timetargeted
Datasets
AdvBench
Applications
multimodal llm safetyvision-language model safety alignment