Text is All You Need for Vision-Language Model Jailbreaking
Yihang Chen , Zhao Xu , Youyuan Jiang , Tianle Zheng , Cho-Jui Hsieh
Published on arXiv
2602.00420
Prompt Injection
OWASP LLM Top 10 — LLM01
Key Finding
Text-DJ successfully bypasses safety alignment on state-of-the-art closed-source and open-source LVLMs including GPT-4.1 mini, Gemini, and Qwen3-VL using only text-as-image inputs
Text-DJ (Text Distraction Jailbreaking)
Novel technique introduced
Large Vision-Language Models (LVLMs) are increasingly equipped with robust safety safeguards to prevent responses to harmful or disallowed prompts. However, these defenses often focus on analyzing explicit textual inputs or relevant visual scenes. In this work, we introduce Text-DJ, a novel jailbreak attack that bypasses these safeguards by exploiting the model's Optical Character Recognition (OCR) capability. Our methodology consists of three stages. First, we decompose a single harmful query into multiple and semantically related but more benign sub-queries. Second, we pick a set of distraction queries that are maximally irrelevant to the harmful query. Third, we present all decomposed sub-queries and distraction queries to the LVLM simultaneously as a grid of images, with the position of the sub-queries being middle within the grid. We demonstrate that this method successfully circumvents the safety alignment of state-of-the-art LVLMs. We argue this attack succeeds by (1) converting text-based prompts into images, bypassing standard text-based filters, and (2) inducing distractions, where the model's safety protocols fail to link the scattered sub-queries within a high number of irrelevant queries. Overall, our findings expose a critical vulnerability in LVLMs' OCR capabilities that are not robust to dispersed, multi-image adversarial inputs, highlighting the need for defenses for fragmented multimodal inputs.
Key Contributions
- Text-DJ: a black-box, model-agnostic jailbreak that decomposes harmful queries into benign sub-queries and embeds them in a distractor image grid, exploiting LVLM OCR to bypass text-based safety filters
- Demonstrates that safety alignment fails when harmful prompts are scattered across multi-image OCR inputs alongside irrelevant distraction queries
- Ablation studies showing semantic distraction is more effective than visual distraction, pinpointing OCR safety as a fundamental unaddressed vulnerability