attack 2026

Toward Universal and Transferable Jailbreak Attacks on Vision-Language Models

0 citations · 62 references · arXiv (Cornell University)

Published on arXiv

2602.01025

Input Manipulation Attack

OWASP ML Top 10 — ML01

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

UltraBreak substantially surpasses prior gradient-based jailbreak methods in black-box average attack success rate on unseen targets, achieving strong transferability across models and universality across attack objectives from a single surrogate.

UltraBreak

Novel technique introduced

Vision-language models (VLMs) extend large language models (LLMs) with vision encoders, enabling text generation conditioned on both images and text. However, this multimodal integration expands the attack surface by exposing the model to image-based jailbreaks crafted to induce harmful responses. Existing gradient-based jailbreak methods transfer poorly, as adversarial patterns overfit to a single white-box surrogate and fail to generalise to black-box models. In this work, we propose Universal and transferable jailbreak (UltraBreak), a framework that constrains adversarial patterns through transformations and regularisation in the vision space, while relaxing textual targets through semantic-based objectives. By defining its loss in the textual embedding space of the target LLM, UltraBreak discovers universal adversarial patterns that generalise across diverse jailbreak objectives. This combination of vision-level regularisation and semantically guided textual supervision mitigates surrogate overfitting and enables strong transferability across both models and attack targets. Extensive experiments show that UltraBreak consistently outperforms prior jailbreak methods. Further analysis reveals why earlier approaches fail to transfer, highlighting that smoothing the loss landscape via semantic objectives is crucial for enabling universal and transferable jailbreaks. The code is publicly available in our \href{https://github.com/kaiyuanCui/UltraBreak}{GitHub repository}.

Key Contributions

UltraBreak: first gradient-based VLM jailbreak achieving simultaneous cross-target universality and cross-model transferability using only a single surrogate model.
Vision-level regularization via randomized transformations and total variation (TV) loss to suppress surrogate overfitting and induce robust, recognizable adversarial patterns.
Semantically weighted textual objectives defined in the LLM embedding space that relax hard cross-entropy targets, smoothing the loss landscape and enabling generalization across diverse harmful objectives.

🛡️ Threat Analysis

Input Manipulation Attack

UltraBreak crafts adversarial image perturbations using gradient-based optimization (with randomized transformations and TV regularization) to manipulate VLM outputs at inference time — a direct input manipulation attack in the visual modality.

Details

Domains

visionnlpmultimodal

Model Types

vlmllmtransformer

Threat Tags

white_boxblack_boxinference_timetargeteddigital

Datasets

AdvBench

Applications

vision-language modelsmultimodal chatbotssafety-aligned llm systems

Read PDF arXiv DOI Code

Toward Universal and Transferable Jailbreak Attacks on Vision-Language Models

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Adversarial Confusion Attack: Disrupting Multimodal Large Language Models

Enhancing Targeted Adversarial Attacks on Large Vision-Language Models via Intermediate Projector

Crafting Adversarial Inputs for Large Vision-Language Models Using Black-Box Optimization

Pushing the Frontier of Black-Box LVLM Attacks via Fine-Grained Detail Targeting

Multi-Paradigm Collaborative Adversarial Attack Against Multi-Modal Large Language Models

FORCE: Transferable Visual Jailbreaking Attacks via Feature Over-Reliance CorrEction

On the Adversarial Robustness of 3D Large Vision-Language Models

Multi-Faceted Attack: Exposing Cross-Model Vulnerabilities in Defense-Equipped Vision-Language Models