attack 2026

Robustness of Vision Language Models Against Split-Image Harmful Input Attacks

Md Rafi Ur Rashid 1, MD Sadik Hossain Shanto 2, Vishnu Asutosh Dasu 1, Shagufta Mehnaz 1

0 citations · 52 references · arXiv (Cornell University)

α

Published on arXiv

2602.08136

Input Manipulation Attack

OWASP ML Top 10 — ML01

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

The strongest SIVA variant achieves up to 60% higher black-box transfer attack success rate than existing baselines across three state-of-the-art VLMs.

SIVA (Adv-KD)

Novel technique introduced


Vision-Language Models (VLMs) are now a core part of modern AI. Recent work proposed several visual jailbreak attacks using single/ holistic images. However, contemporary VLMs demonstrate strong robustness against such attacks due to extensive safety alignment through preference optimization (e.g., RLHF). In this work, we identify a new vulnerability: while VLM pretraining and instruction tuning generalize well to split-image inputs, safety alignment is typically performed only on holistic images and does not account for harmful semantics distributed across multiple image fragments. Consequently, VLMs often fail to detect and refuse harmful split-image inputs, where unsafe cues emerge only after combining images. We introduce novel split-image visual jailbreak attacks (SIVA) that exploit this misalignment. Unlike prior optimization-based attacks, which exhibit poor black-box transferability due to architectural and prior mismatches across models, our attacks evolve in progressive phases from naive splitting to an adaptive white-box attack, culminating in a black-box transfer attack. Our strongest strategy leverages a novel adversarial knowledge distillation (Adv-KD) algorithm to substantially improve cross-model transferability. Evaluations on three state-of-the-art modern VLMs and three jailbreak datasets demonstrate that our strongest attack achieves up to 60% higher transfer success than existing baselines. Lastly, we propose efficient ways to address this critical vulnerability in the current VLM safety alignment.


Key Contributions

  • Identifies a novel safety alignment gap in VLMs: alignment is performed on holistic images but models generalize to split-image inputs, enabling harmful semantics to bypass safety filters when distributed across multiple fragments.
  • Introduces SIVA — a progressive attack framework evolving from naive image splitting through adaptive white-box optimization to a highly transferable black-box attack.
  • Proposes Adv-KD (Adversarial Knowledge Distillation), a novel algorithm that improves cross-model black-box transfer success by up to 60% over existing baselines on three state-of-the-art VLMs.

🛡️ Threat Analysis

Input Manipulation Attack

The attack involves adversarial manipulation of visual inputs to VLMs — including gradient-based white-box optimization of split images and adversarial knowledge distillation for transferable perturbations — directly fitting the 'adversarial visual inputs to VLMs' dual-tagging criterion.


Details

Domains
visionmultimodalnlp
Model Types
vlmmultimodal
Threat Tags
white_boxblack_boxinference_timetargeteddigital
Datasets
AdvBenchHarmBenchMM-SafetyBench
Applications
vision-language modelsmultimodal chatbotsai assistants