attack 2025

Multimodal Safety Is Asymmetric: Cross-Modal Exploits Unlock Black-Box MLLMs Jailbreaks

Xinkai Wang ¹, Beibei Li ¹, Zerui Shao ¹, Ao Liu ¹, Guangquan Xu ², Shouling Ji ³

¹ Sichuan University

² Tianjin University

³ Zhejiang University

1 citations · 44 references · arXiv

Published on arXiv

2510.17277

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

PolyJailbreak achieves a jailbreak success rate exceeding 95% on commercial black-box models (GPT-4o, Gemini) with an average 18.15% improvement over state-of-the-art baselines.

PolyJailbreak

Novel technique introduced

Multimodal large language models (MLLMs) have demonstrated significant utility across diverse real-world applications. But MLLMs remain vulnerable to jailbreaks, where adversarial inputs can collapse their safety constraints and trigger unethical responses. In this work, we investigate jailbreaks in the text-vision multimodal setting and pioneer the observation that visual alignment imposes uneven safety constraints across modalities in MLLMs, thereby giving rise to multimodal safety asymmetry. We then develop PolyJailbreak, a black-box jailbreak method grounded in reinforcement learning. Initially, we probe the model's attention dynamics and latent representation space, assessing how visual inputs reshape cross-modal information flow and diminish the model's ability to separate harmful from benign inputs, thereby exposing exploitable vulnerabilities. On this basis, we systematize them into generalizable and reusable operational rules that constitute a structured library of Atomic Strategy Primitives, which translate harmful intents into jailbreak inputs through step-wise transformations. Guided by the primitives, PolyJailbreak employs a multi-agent optimization process that automatically adapts inputs against the target models. We conduct comprehensive evaluations on a variety of open-source and closed-source MLLMs, demonstrating that PolyJailbreak outperforms state-of-the-art baselines.

Key Contributions

Identifies 'multimodal safety asymmetry' — visual alignment introduces weaker, uneven safety constraints relative to text modality, creating exploitable cross-modal vulnerabilities in MLLMs
Proposes a structured library of Atomic Strategy Primitives (ASPs) spanning text manipulation, visual manipulation, and prompt amplification that systematize discovered vulnerabilities into reusable jailbreak building blocks
Develops PolyJailbreak, a reinforcement learning-based multi-agent optimization framework that compositionally assembles ASPs into black-box jailbreak inputs, achieving 18.15% average improvement over SOTA and >95% success rate on GPT-4o and Gemini

🛡️ Threat Analysis

Details

Domains

nlpmultimodal

Model Types

llmvlmmultimodal

Threat Tags

black_boxinference_timetargeted

Datasets

AdvBench

Applications

multimodal llm safetyvision-language model safety alignment

Read PDF arXiv DOI

Multimodal Safety Is Asymmetric: Cross-Modal Exploits Unlock Black-Box MLLMs Jailbreaks

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Text is All You Need for Vision-Language Model Jailbreaking

Jailbreaking Multimodal Large Language Models via Shuffle Inconsistency

Gaming the Judge: Unfaithful Chain-of-Thought Can Undermine Agent Evaluation

The Emotional Baby Is Truly Deadly: Does your Multimodal Large Reasoning Model Have Emotional Flattery towards Humans?

Multi-Turn Adaptive Prompting Attack on Large Vision-Language Models

JPRO: Automated Multimodal Jailbreaking via Multi-Agent Collaboration Framework

Contextual Image Attack: How Visual Context Exposes Multimodal Safety Vulnerabilities

STaR-Attack: A Spatio-Temporal and Narrative Reasoning Attack Framework for Unified Multimodal Understanding and Generation Models