When the Prompt Becomes Visual: Vision-Centric Jailbreak Attacks for Large Image Editing Models
Jiacheng Hou 1, Yining Sun 1,2, Ruochong Jin 1,2, Haochen Han 2, Fangming Liu 2, Wai Kin Victor Chan 1, Alex Jinpeng Wang 3
Published on arXiv
2602.10179
Prompt Injection
OWASP LLM Top 10 — LLM01
Key Finding
VJA achieves attack success rates of up to 80.9% on Nano Banana Pro and 70.1% on GPT-Image-1.5, demonstrating widespread vulnerability in commercial image editing models to visual-only jailbreaks
VJA (Vision-Centric Jailbreak Attack)
Novel technique introduced
Recent advances in large image editing models have shifted the paradigm from text-driven instructions to vision-prompt editing, where user intent is inferred directly from visual inputs such as marks, arrows, and visual-text prompts. While this paradigm greatly expands usability, it also introduces a critical and underexplored safety risk: the attack surface itself becomes visual. In this work, we propose Vision-Centric Jailbreak Attack (VJA), the first visual-to-visual jailbreak attack that conveys malicious instructions purely through visual inputs. To systematically study this emerging threat, we introduce IESBench, a safety-oriented benchmark for image editing models. Extensive experiments on IESBench demonstrate that VJA effectively compromises state-of-the-art commercial models, achieving attack success rates of up to 80.9% on Nano Banana Pro and 70.1% on GPT-Image-1.5. To mitigate this vulnerability, we propose a training-free defense based on introspective multimodal reasoning, which substantially improves the safety of poorly aligned models to a level comparable with commercial systems, without auxiliary guard models and with negligible computational overhead. Our findings expose new vulnerabilities, provide both a benchmark and practical defense to advance safe and trustworthy modern image editing systems. Warning: This paper contains offensive images created by large image editing models.
Key Contributions
- VJA: first visual-to-visual jailbreak attack that encodes malicious instructions purely through semantic visual elements (marks, arrows, visual-text prompts) directed at image editing VLMs
- IESBench: safety-oriented benchmark for systematically evaluating image editing models against visual jailbreak threats
- Training-free defense based on introspective multimodal reasoning that raises poorly-aligned models to commercial-grade safety levels with negligible overhead and no auxiliary guard models