attack 2026

Beyond Visual Safety: Jailbreaking Multimodal Large Language Models for Harmful Image Generation via Semantic-Agnostic Inputs

Mingyu Yu , Lana Liu , Zhehao Zhao , Wei Wang , Sujuan Qin

0 citations · 35 references · arXiv

α

Published on arXiv

2601.15698

Input Manipulation Attack

OWASP ML Top 10 — ML01

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

BVS achieves a 98.21% jailbreak success rate against GPT-5, revealing critical vulnerabilities in current MLLM visual safety alignment mechanisms.

BVS (Beyond Visual Safety)

Novel technique introduced


The rapid advancement of Multimodal Large Language Models (MLLMs) has introduced complex security challenges, particularly at the intersection of textual and visual safety. While existing schemes have explored the security vulnerabilities of MLLMs, the investigation into their visual safety boundaries remains insufficient. In this paper, we propose Beyond Visual Safety (BVS), a novel image-text pair jailbreaking framework specifically designed to probe the visual safety boundaries of MLLMs. BVS employs a "reconstruction-then-generation" strategy, leveraging neutralized visual splicing and inductive recomposition to decouple malicious intent from raw inputs, thereby leading MLLMs to be induced into generating harmful images. Experimental results demonstrate that BVS achieves a remarkable jailbreak success rate of 98.21\% against GPT-5 (12 January 2026 release). Our findings expose critical vulnerabilities in the visual safety alignment of current MLLMs.


Key Contributions

  • BVS framework using 'reconstruction-then-generation' strategy with neutralized visual splicing and inductive recomposition to decouple malicious intent from raw inputs
  • First jailbreak framework targeting visual safety boundaries of MLLMs specifically for harmful image generation (as opposed to harmful text generation)
  • Benchmark dataset for evaluating visual safety of MLLMs, demonstrating 98.21% jailbreak success rate against GPT-5

🛡️ Threat Analysis

Input Manipulation Attack

BVS crafts adversarial visual inputs (neutralized visual splicing — compositing images with large semantic distances) specifically designed to bypass VLM safety mechanisms, constituting input manipulation at the visual modality level.


Details

Domains
visionnlpmultimodal
Model Types
vlmllmmultimodal
Threat Tags
black_boxinference_timetargeted
Datasets
Custom MLLM visual safety benchmark (introduced in paper)
Applications
multimodal llm safety alignmentharmful image generation prevention