attack 2026

Red-teaming the Multimodal Reasoning: Jailbreaking Vision-Language Models via Cross-modal Entanglement Attacks

Yu Yan ^1,2, Sheng Sun ¹, Shengjia Cheng ³, Teli Liu ³, Mingfeng Li ³, Min Liu ^1,2

¹ Institute of Computing Technology, Chinese Academy of Sciences

² University of Chinese Academy of Sciences

³ People's Public Security University of China

0 citations · arXiv (Cornell University)

Published on arXiv

2602.10148

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

COMET achieves state-of-the-art attack success rate against advanced VLMs' safety alignment under black-box conditions

COMET (CrossTALK)

Novel technique introduced

Vision-Language Models (VLMs) with multimodal reasoning capabilities are high-value attack targets, given their potential for handling complex multimodal harmful tasks. Mainstream black-box jailbreak attacks on VLMs work by distributing malicious clues across modalities to disperse model attention and bypass safety alignment mechanisms. However, these adversarial attacks rely on simple and fixed image-text combinations that lack attack complexity scalability, limiting their effectiveness for red-teaming VLMs' continuously evolving reasoning capabilities. We propose \textbf{CrossTALK} (\textbf{\underline{Cross}}-modal en\textbf{\underline{TA}}ng\textbf{\underline{L}}ement attac\textbf{\underline{K}}), which is a scalable approach that extends and entangles information clues across modalities to exceed VLMs' trained and generalized safety alignment patterns for jailbreak. Specifically, {knowledge-scalable reframing} extends harmful tasks into multi-hop chain instructions, {cross-modal clue entangling} migrates visualizable entities into images to build multimodal reasoning links, and {cross-modal scenario nesting} uses multimodal contextual instructions to steer VLMs toward detailed harmful outputs. Experiments show our COMET achieves state-of-the-art attack success rate.

Key Contributions

CrossTALK / COMET: a scalable cross-modal jailbreak framework that entangles harmful intent across text and image modalities to exceed VLM safety alignment patterns
Knowledge-scalable reframing that expands harmful tasks into multi-hop chain-of-thought instructions to increase attack complexity
Cross-modal clue entangling and cross-modal scenario nesting that migrate visualizable entities into images and use multimodal contextual instructions to steer VLMs toward detailed harmful outputs

🛡️ Threat Analysis

Details

Domains

multimodalvisionnlp

Model Types

vlmmultimodal

Threat Tags

black_boxinference_timetargeted

Applications

vision-language modelsmultimodal reasoning systems

Read PDF arXiv DOI

Red-teaming the Multimodal Reasoning: Jailbreaking Vision-Language Models via Cross-modal Entanglement Attacks

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

MIDAS: Multi-Image Dispersion and Semantic Reconstruction for Jailbreaking MLLMs

GAMBIT: A Gamified Jailbreak Framework for Multimodal Large Language Models

Contextual Image Attack: How Visual Context Exposes Multimodal Safety Vulnerabilities

VRSA: Jailbreaking Multimodal Large Language Models through Visual Reasoning Sequential Attack

STaR-Attack: A Spatio-Temporal and Narrative Reasoning Attack Framework for Unified Multimodal Understanding and Generation Models

The Shawshank Redemption of Embodied AI: Understanding and Benchmarking Indirect Environmental Jailbreaks

TreeTeaming: Autonomous Red-Teaming of Vision-Language Models via Hierarchical Strategy Exploration

Enhanced MLLM Black-Box Jailbreaking Attacks and Defenses