Jailbreaks on Vision Language Model via Multimodal Reasoning
Aarush Noheria 1,2, Yuguang Yao 2
Published on arXiv
2601.22398
Input Manipulation Attack
OWASP ML Top 10 — ML01
Prompt Injection
OWASP LLM Top 10 — LLM01
Key Finding
The dual-strategy of CoT prompting and ReAct-driven adaptive image noising significantly improves attack success rate while maintaining naturalness in both text and visual modalities
ReAct-CoT Jailbreak
Novel technique introduced
Vision-language models (VLMs) have become central to tasks such as visual question answering, image captioning, and text-to-image generation. However, their outputs are highly sensitive to prompt variations, which can reveal vulnerabilities in safety alignment. In this work, we present a jailbreak framework that exploits post-training Chain-of-Thought (CoT) prompting to construct stealthy prompts capable of bypassing safety filters. To further increase attack success rates (ASR), we propose a ReAct-driven adaptive noising mechanism that iteratively perturbs input images based on model feedback. This approach leverages the ReAct paradigm to refine adversarial noise in regions most likely to activate safety defenses, thereby enhancing stealth and evasion. Experimental results demonstrate that the proposed dual-strategy significantly improves ASR while maintaining naturalness in both text and visual domains.
Key Contributions
- CoT-based stealthy prompt construction exploiting reasoning capabilities to bypass VLM safety filters
- ReAct-driven adaptive image noising mechanism that iteratively refines adversarial perturbations in regions most likely to activate safety defenses
- Dual-strategy framework combining text-level and visual-level attacks achieving improved attack success rates while preserving naturalness
🛡️ Threat Analysis
Proposes a ReAct-driven adaptive noising mechanism that iteratively perturbs input images based on model feedback to evade VLM safety defenses — a direct adversarial visual input attack on inference-time inputs.