Why does weak-OOD help? A Further Step Towards Understanding Jailbreaking VLMs
Yuxuan Zhou 1, Yuzhao Peng 1, Yang Bai 2, Kuofeng Gao 1, Yihao Zhang 3, Yechao Zhang 4, Xun Chen 2, Tao Yu 5, Tao Dai 6, Shu-Tao Xia 1
Published on arXiv
2511.08367
Input Manipulation Attack
OWASP ML Top 10 — ML01
Prompt Injection
OWASP LLM Top 10 — LLM01
Key Finding
JOCR, leveraging OCR capabilities embedded during VLM pre-training to render harmful content as readable image text, outperforms SOTA OOD-based jailbreak baselines while requiring minimal additional computation.
JOCR
Novel technique introduced
Large Vision-Language Models (VLMs) are susceptible to jailbreak attacks: researchers have developed a variety of attack strategies that can successfully bypass the safety mechanisms of VLMs. Among these approaches, jailbreak methods based on the Out-of-Distribution (OOD) strategy have garnered widespread attention due to their simplicity and effectiveness. This paper further advances the in-depth understanding of OOD-based VLM jailbreak methods. Experimental results demonstrate that jailbreak samples generated via mild OOD strategies exhibit superior performance in circumventing the safety constraints of VLMs--a phenomenon we define as ''weak-OOD''. To unravel the underlying causes of this phenomenon, this study takes SI-Attack, a typical OOD-based jailbreak method, as the research object. We attribute this phenomenon to a trade-off between two dominant factors: input intent perception and model refusal triggering. The inconsistency in how these two factors respond to OOD manipulations gives rise to this phenomenon. Furthermore, we provide a theoretical argument for the inevitability of such inconsistency from the perspective of discrepancies between model pre-training and alignment processes. Building on the above insights, we draw inspiration from optical character recognition (OCR) capability enhancement--a core task in the pre-training phase of mainstream VLMs. Leveraging this capability, we design a simple yet highly effective VLM jailbreak method, whose performance outperforms that of SOTA baselines.
Key Contributions
- Identifies and formalizes the 'weak-OOD' phenomenon: mild OOD image manipulation yields better jailbreak performance than stronger OOD shifts, and shows this is not due to simple feature destruction
- Attributes the weak-OOD phenomenon to an asymmetry between 'input intent perception' (robust to OOD) and 'model refusal triggering' (non-robust), rooted in the gap between VLM pre-training and safety alignment generalization
- Proposes JOCR, an OCR-inspired VLM jailbreak method that maintains intent perception while suppressing model refusal, outperforming SOTA baselines at minimal cost
🛡️ Threat Analysis
The paper proposes JOCR, which manipulates visual inputs (embedding harmful text via OCR-style rendering in images) to cause VLMs to bypass safety mechanisms at inference time — a direct adversarial visual input manipulation attack on VLMs causing safety failure.