attack 2026

Jailbreaks on Vision Language Model via Multimodal Reasoning

Aarush Noheria 1,2, Yuguang Yao 2

0 citations · 12 references · arXiv

α

Published on arXiv

2601.22398

Input Manipulation Attack

OWASP ML Top 10 — ML01

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

The dual-strategy of CoT prompting and ReAct-driven adaptive image noising significantly improves attack success rate while maintaining naturalness in both text and visual modalities

ReAct-CoT Jailbreak

Novel technique introduced


Vision-language models (VLMs) have become central to tasks such as visual question answering, image captioning, and text-to-image generation. However, their outputs are highly sensitive to prompt variations, which can reveal vulnerabilities in safety alignment. In this work, we present a jailbreak framework that exploits post-training Chain-of-Thought (CoT) prompting to construct stealthy prompts capable of bypassing safety filters. To further increase attack success rates (ASR), we propose a ReAct-driven adaptive noising mechanism that iteratively perturbs input images based on model feedback. This approach leverages the ReAct paradigm to refine adversarial noise in regions most likely to activate safety defenses, thereby enhancing stealth and evasion. Experimental results demonstrate that the proposed dual-strategy significantly improves ASR while maintaining naturalness in both text and visual domains.


Key Contributions

  • CoT-based stealthy prompt construction exploiting reasoning capabilities to bypass VLM safety filters
  • ReAct-driven adaptive image noising mechanism that iteratively refines adversarial perturbations in regions most likely to activate safety defenses
  • Dual-strategy framework combining text-level and visual-level attacks achieving improved attack success rates while preserving naturalness

🛡️ Threat Analysis

Input Manipulation Attack

Proposes a ReAct-driven adaptive noising mechanism that iteratively perturbs input images based on model feedback to evade VLM safety defenses — a direct adversarial visual input attack on inference-time inputs.


Details

Domains
visionnlpmultimodal
Model Types
vlmllmmultimodal
Threat Tags
black_boxinference_timetargeteddigital
Applications
visual question answeringimage captioningtext-to-image generation