attack 2026

Mosaic: Multimodal Jailbreak against Closed-Source VLMs via Multi-View Ensemble Optimization

Yuqin Lan 1, Gen Li 1, Yuanze Hu 1, Weihao Shen 1, Zhaoxin Fan 1, Faguo Wu 1, Xiao Zhang 1, Laurence T. Yang 2, Zhiming Zheng 1

0 citations

α

Published on arXiv

2604.09253

Input Manipulation Attack

OWASP ML Top 10 — ML01

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Achieves state-of-the-art Attack Success Rate and Average Toxicity against commercial closed-source VLMs by alleviating surrogate dependency through multi-view ensemble optimization

Mosaic

Novel technique introduced


Vision-Language Models (VLMs) are powerful but remain vulnerable to multimodal jailbreak attacks. Existing attacks mainly rely on either explicit visual prompt attacks or gradient-based adversarial optimization. While the former is easier to detect, the latter produces subtle perturbations that are less perceptible, but is usually optimized and evaluated under homogeneous open-source surrogate-target settings, leaving its effectiveness on commercial closed-source VLMs under heterogeneous settings unclear. To examine this issue, we study different surrogate-target settings and observe a consistent gap between homogeneous and heterogeneous settings, a phenomenon we term surrogate dependency. Motivated by this finding, we propose Mosaic, a Multi-view ensemble optimization framework for multimodal jailbreak against closed-source VLMs, which alleviates surrogate dependency under heterogeneous surrogate-target settings by reducing over-reliance on any single surrogate model and visual view. Specifically, Mosaic incorporates three core components: a Text-Side Transformation module, which perturbs refusal-sensitive lexical patterns; a Multi-View Image Optimization module, which updates perturbations under diverse cropped views to avoid overfitting to a single visual view; and a Surrogate Ensemble Guidance module, which aggregates optimization signals from multiple surrogate VLMs to reduce surrogate-specific bias. Extensive experiments on safety benchmarks demonstrate that Mosaic achieves state-of-the-art Attack Success Rate and Average Toxicity against commercial closed-source VLMs.


Key Contributions

  • Identifies and characterizes surrogate dependency phenomenon in gradient-based VLM jailbreak attacks under heterogeneous settings
  • Proposes Mosaic framework with three core components: Text-Side Transformation, Multi-View Image Optimization, and Surrogate Ensemble Guidance
  • Achieves state-of-the-art attack success rates against commercial closed-source VLMs like GPT-4V and Gemini by reducing over-reliance on single surrogate models

🛡️ Threat Analysis

Input Manipulation Attack

The paper proposes gradient-based adversarial perturbations on images (Multi-View Image Optimization) that cause VLMs to produce harmful outputs at inference time. This is a clear adversarial example attack using gradient optimization to craft subtle visual perturbations.


Details

Domains
multimodalvisionnlp
Model Types
vlmmultimodaltransformer
Threat Tags
white_boxinference_timeuntargeteddigital
Datasets
safety benchmarks
Applications
vision-language modelsmultimodal ai systems