attack 2026

Toward Universal and Transferable Jailbreak Attacks on Vision-Language Models

Kaiyuan Cui 1, Yige Li 2, Yutao Wu 3, Xingjun Ma 4, Sarah Erfani 1, Christopher Leckie 1, Hanxun Huang 1

0 citations · 62 references · arXiv (Cornell University)

α

Published on arXiv

2602.01025

Input Manipulation Attack

OWASP ML Top 10 — ML01

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

UltraBreak substantially surpasses prior gradient-based jailbreak methods in black-box average attack success rate on unseen targets, achieving strong transferability across models and universality across attack objectives from a single surrogate.

UltraBreak

Novel technique introduced


Vision-language models (VLMs) extend large language models (LLMs) with vision encoders, enabling text generation conditioned on both images and text. However, this multimodal integration expands the attack surface by exposing the model to image-based jailbreaks crafted to induce harmful responses. Existing gradient-based jailbreak methods transfer poorly, as adversarial patterns overfit to a single white-box surrogate and fail to generalise to black-box models. In this work, we propose Universal and transferable jailbreak (UltraBreak), a framework that constrains adversarial patterns through transformations and regularisation in the vision space, while relaxing textual targets through semantic-based objectives. By defining its loss in the textual embedding space of the target LLM, UltraBreak discovers universal adversarial patterns that generalise across diverse jailbreak objectives. This combination of vision-level regularisation and semantically guided textual supervision mitigates surrogate overfitting and enables strong transferability across both models and attack targets. Extensive experiments show that UltraBreak consistently outperforms prior jailbreak methods. Further analysis reveals why earlier approaches fail to transfer, highlighting that smoothing the loss landscape via semantic objectives is crucial for enabling universal and transferable jailbreaks. The code is publicly available in our \href{https://github.com/kaiyuanCui/UltraBreak}{GitHub repository}.


Key Contributions

  • UltraBreak: first gradient-based VLM jailbreak achieving simultaneous cross-target universality and cross-model transferability using only a single surrogate model.
  • Vision-level regularization via randomized transformations and total variation (TV) loss to suppress surrogate overfitting and induce robust, recognizable adversarial patterns.
  • Semantically weighted textual objectives defined in the LLM embedding space that relax hard cross-entropy targets, smoothing the loss landscape and enabling generalization across diverse harmful objectives.

🛡️ Threat Analysis

Input Manipulation Attack

UltraBreak crafts adversarial image perturbations using gradient-based optimization (with randomized transformations and TV regularization) to manipulate VLM outputs at inference time — a direct input manipulation attack in the visual modality.


Details

Domains
visionnlpmultimodal
Model Types
vlmllmtransformer
Threat Tags
white_boxblack_boxinference_timetargeteddigital
Datasets
AdvBench
Applications
vision-language modelsmultimodal chatbotssafety-aligned llm systems