attack 2026

When the Prompt Becomes Visual: Vision-Centric Jailbreak Attacks for Large Image Editing Models

Jiacheng Hou ¹, Yining Sun ^1,2, Ruochong Jin ^1,2, Haochen Han ², Fangming Liu ², Wai Kin Victor Chan ¹, Alex Jinpeng Wang ³

¹ Tsinghua University

² Peng Cheng Laboratory

³ Central South University

0 citations · 46 references · arXiv (Cornell University)

Published on arXiv

2602.10179

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

VJA achieves attack success rates of up to 80.9% on Nano Banana Pro and 70.1% on GPT-Image-1.5, demonstrating widespread vulnerability in commercial image editing models to visual-only jailbreaks

VJA (Vision-Centric Jailbreak Attack)

Novel technique introduced

Recent advances in large image editing models have shifted the paradigm from text-driven instructions to vision-prompt editing, where user intent is inferred directly from visual inputs such as marks, arrows, and visual-text prompts. While this paradigm greatly expands usability, it also introduces a critical and underexplored safety risk: the attack surface itself becomes visual. In this work, we propose Vision-Centric Jailbreak Attack (VJA), the first visual-to-visual jailbreak attack that conveys malicious instructions purely through visual inputs. To systematically study this emerging threat, we introduce IESBench, a safety-oriented benchmark for image editing models. Extensive experiments on IESBench demonstrate that VJA effectively compromises state-of-the-art commercial models, achieving attack success rates of up to 80.9% on Nano Banana Pro and 70.1% on GPT-Image-1.5. To mitigate this vulnerability, we propose a training-free defense based on introspective multimodal reasoning, which substantially improves the safety of poorly aligned models to a level comparable with commercial systems, without auxiliary guard models and with negligible computational overhead. Our findings expose new vulnerabilities, provide both a benchmark and practical defense to advance safe and trustworthy modern image editing systems. Warning: This paper contains offensive images created by large image editing models.

Key Contributions

VJA: first visual-to-visual jailbreak attack that encodes malicious instructions purely through semantic visual elements (marks, arrows, visual-text prompts) directed at image editing VLMs
IESBench: safety-oriented benchmark for systematically evaluating image editing models against visual jailbreak threats
Training-free defense based on introspective multimodal reasoning that raises poorly-aligned models to commercial-grade safety levels with negligible overhead and no auxiliary guard models

🛡️ Threat Analysis

Details

Domains

visionmultimodalgenerative

Model Types

vlmdiffusionmultimodal

Threat Tags

black_boxinference_timetargeted

Datasets

IESBench

Applications

image editingvision-prompt image generation

Read PDF arXiv DOI Code

When the Prompt Becomes Visual: Vision-Centric Jailbreak Attacks for Large Image Editing Models

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

$PC^2$: Politically Controversial Content Generation via Jailbreaking Attacks on GPT-based Text-to-Image Models

VEIL: Jailbreaking Text-to-Video Models via Visual Exploitation from Implicit Language

ConceptGuard: Proactive Safety in Text-and-Image-to-Video Generation through Multimodal Risk Detection

Two Frames Matter: A Temporal Attack for Text-to-Video Model Jailbreaking

Low-Effort Jailbreak Attacks Against Text-to-Image Safety Filters

Jailbreaking on Text-to-Video Models via Scene Splitting Strategy

MacPrompt: Maraconic-guided Jailbreak against Text-to-Image Models

VII: Visual Instruction Injection for Jailbreaking Image-to-Video Generation Models