Why does weak-OOD help? A Further Step Towards Understanding Jailbreaking VLMs

Large Vision-Language Models (VLMs) are susceptible to jailbreak attacks: researchers have developed a variety of attack strategies that can successfully bypass the safety mechanisms of VLMs. Among these approaches, jailbreak methods based on the Out-of-Distribution (OOD) strategy have garnered widespread attention due to their simplicity and effectiveness. This paper further advances the in-depth understanding of OOD-based VLM jailbreak methods. Experimental results demonstrate that jailbreak samples generated via mild OOD strategies exhibit superior performance in circumventing the safety constraints of VLMs--a phenomenon we define as ''weak-OOD''. To unravel the underlying causes of this phenomenon, this study takes SI-Attack, a typical OOD-based jailbreak method, as the research object. We attribute this phenomenon to a trade-off between two dominant factors: input intent perception and model refusal triggering. The inconsistency in how these two factors respond to OOD manipulations gives rise to this phenomenon. Furthermore, we provide a theoretical argument for the inevitability of such inconsistency from the perspective of discrepancies between model pre-training and alignment processes. Building on the above insights, we draw inspiration from optical character recognition (OCR) capability enhancement--a core task in the pre-training phase of mainstream VLMs. Leveraging this capability, we design a simple yet highly effective VLM jailbreak method, whose performance outperforms that of SOTA baselines.

Key Contributions

Identifies and formalizes the 'weak-OOD' phenomenon: mild OOD image manipulation yields better jailbreak performance than stronger OOD shifts, and shows this is not due to simple feature destruction
Attributes the weak-OOD phenomenon to an asymmetry between 'input intent perception' (robust to OOD) and 'model refusal triggering' (non-robust), rooted in the gap between VLM pre-training and safety alignment generalization
Proposes JOCR, an OCR-inspired VLM jailbreak method that maintains intent perception while suppressing model refusal, outperforming SOTA baselines at minimal cost

🛡️ Threat Analysis

Input Manipulation Attack

The paper proposes JOCR, which manipulates visual inputs (embedding harmful text via OCR-style rendering in images) to cause VLMs to bypass safety mechanisms at inference time — a direct adversarial visual input manipulation attack on VLMs causing safety failure.

Details

Domains

visionmultimodalnlp

Model Types

vlmllmmultimodal

Threat Tags

black_boxinference_timetargeted

Applications

2025 0 cit.

Input Manipulation Attack

95%