attack 2026

Reasoning-Oriented Programming: Chaining Semantic Gadgets to Jailbreak Large Vision Language Models

Quanchen Zou ¹, Moyang Chen ², Zonghao Ying ³, Wenzhuo Xu ¹, Yisong Xiao ³, Deyue Zhang ¹, Dongdong Yang ¹, Zhao Liu ¹, Xiangzheng Zhang ¹

¹ 360 AI Security Lab

² Wenzhou-Kean University

³ Beihang University

0 citations

Published on arXiv

2603.09246

Input Manipulation Attack

OWASP ML Top 10 — ML01

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Outperforms the strongest existing jailbreak baseline by an average of 4.67% ASR on open-source LVLMs and 9.50% on commercial models including GPT-4o and Claude 3.7 Sonnet.

Reasoning-Oriented Programming (ROP)

Novel technique introduced

Large Vision-Language Models (LVLMs) undergo safety alignment to suppress harmful content. However, current defenses predominantly target explicit malicious patterns in the input representation, often overlooking the vulnerabilities inherent in compositional reasoning. In this paper, we identify a systemic flaw where LVLMs can be induced to synthesize harmful logic from benign premises. We formalize this attack paradigm as \textit{Reasoning-Oriented Programming}, drawing a structural analogy to Return-Oriented Programming in systems security. Just as ROP circumvents memory protections by chaining benign instruction sequences, our approach exploits the model's instruction-following capability to orchestrate a semantic collision of orthogonal benign inputs. We instantiate this paradigm via \tool{}, an automated framework that optimizes for \textit{semantic orthogonality} and \textit{spatial isolation}. By generating visual gadgets that are semantically decoupled from the harmful intent and arranging them to prevent premature feature fusion, \tool{} forces the malicious logic to emerge only during the late-stage reasoning process. This effectively bypasses perception-level alignment. We evaluate \tool{} on SafeBench and MM-SafetyBench across 7 state-of-the-art 0.LVLMs, including GPT-4o and Claude 3.7 Sonnet. Our results demonstrate that \tool{} consistently circumvents safety alignment, outperforming the strongest existing baseline by an average of 4.67\% on open-source models and 9.50\% on commercial models.

Key Contributions

Introduces Reasoning-Oriented Programming, a novel adversarial paradigm that exploits LVLM compositional reasoning by distributing harmful intent across semantically orthogonal benign visual gadgets.
Implements an automated attack framework using semantic gadget mining and gradient-free evolutionary search for control-flow prompt optimization.
Achieves SOTA jailbreak performance across 7 LVLMs including GPT-4o and Claude 3.7 Sonnet, outperforming baselines by 4.67% on open-source and 9.50% on commercial models.

🛡️ Threat Analysis

Input Manipulation Attack

The framework constructs optimized visual gadgets — specifically crafted visual inputs to VLMs — that when orchestrated together produce harmful outputs, constituting adversarial visual input manipulation targeting inference-time safety mechanisms.

Details

Domains

visionnlpmultimodal

Model Types

vlmllm

Threat Tags

black_boxinference_timetargeted

Datasets

SafeBenchMM-SafetyBench

Applications

large vision-language modelsmultimodal safety alignment

Read PDF arXiv

Reasoning-Oriented Programming: Chaining Semantic Gadgets to Jailbreak Large Vision Language Models

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Failures to Surface Harmful Contents in Video Large Language Models

VERA-V: Variational Inference Framework for Jailbreaking Vision-Language Models

Structured Visual Narratives Undermine Safety Alignment in Multimodal Large Language Models

Towards Effective MLLM Jailbreaking Through Balanced On-Topicness and OOD-Intensity

Text Prompt Injection of Vision Language Models

Black-box Optimization of LLM Outputs by Asking for Directions

AgentTypo: Adaptive Typographic Prompt Injection Attacks against Black-box Multimodal Agents

Why does weak-OOD help? A Further Step Towards Understanding Jailbreaking VLMs