Reasoning-Oriented Programming: Chaining Semantic Gadgets to Jailbreak Large Vision Language Models
Quanchen Zou 1, Moyang Chen 2, Zonghao Ying 3, Wenzhuo Xu 1, Yisong Xiao 3, Deyue Zhang 1, Dongdong Yang 1, Zhao Liu 1, Xiangzheng Zhang 1
Published on arXiv
2603.09246
Input Manipulation Attack
OWASP ML Top 10 — ML01
Prompt Injection
OWASP LLM Top 10 — LLM01
Key Finding
Outperforms the strongest existing jailbreak baseline by an average of 4.67% ASR on open-source LVLMs and 9.50% on commercial models including GPT-4o and Claude 3.7 Sonnet.
Reasoning-Oriented Programming (ROP)
Novel technique introduced
Large Vision-Language Models (LVLMs) undergo safety alignment to suppress harmful content. However, current defenses predominantly target explicit malicious patterns in the input representation, often overlooking the vulnerabilities inherent in compositional reasoning. In this paper, we identify a systemic flaw where LVLMs can be induced to synthesize harmful logic from benign premises. We formalize this attack paradigm as \textit{Reasoning-Oriented Programming}, drawing a structural analogy to Return-Oriented Programming in systems security. Just as ROP circumvents memory protections by chaining benign instruction sequences, our approach exploits the model's instruction-following capability to orchestrate a semantic collision of orthogonal benign inputs. We instantiate this paradigm via \tool{}, an automated framework that optimizes for \textit{semantic orthogonality} and \textit{spatial isolation}. By generating visual gadgets that are semantically decoupled from the harmful intent and arranging them to prevent premature feature fusion, \tool{} forces the malicious logic to emerge only during the late-stage reasoning process. This effectively bypasses perception-level alignment. We evaluate \tool{} on SafeBench and MM-SafetyBench across 7 state-of-the-art 0.LVLMs, including GPT-4o and Claude 3.7 Sonnet. Our results demonstrate that \tool{} consistently circumvents safety alignment, outperforming the strongest existing baseline by an average of 4.67\% on open-source models and 9.50\% on commercial models.
Key Contributions
- Introduces Reasoning-Oriented Programming, a novel adversarial paradigm that exploits LVLM compositional reasoning by distributing harmful intent across semantically orthogonal benign visual gadgets.
- Implements an automated attack framework using semantic gadget mining and gradient-free evolutionary search for control-flow prompt optimization.
- Achieves SOTA jailbreak performance across 7 LVLMs including GPT-4o and Claude 3.7 Sonnet, outperforming baselines by 4.67% on open-source and 9.50% on commercial models.
🛡️ Threat Analysis
The framework constructs optimized visual gadgets — specifically crafted visual inputs to VLMs — that when orchestrated together produce harmful outputs, constituting adversarial visual input manipulation targeting inference-time safety mechanisms.