attack 2026

Reasoning-Oriented Programming: Chaining Semantic Gadgets to Jailbreak Large Vision Language Models

Quanchen Zou 1, Moyang Chen 2, Zonghao Ying 3, Wenzhuo Xu 1, Yisong Xiao 3, Deyue Zhang 1, Dongdong Yang 1, Zhao Liu 1, Xiangzheng Zhang 1

0 citations

α

Published on arXiv

2603.09246

Input Manipulation Attack

OWASP ML Top 10 — ML01

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Outperforms the strongest existing jailbreak baseline by an average of 4.67% ASR on open-source LVLMs and 9.50% on commercial models including GPT-4o and Claude 3.7 Sonnet.

Reasoning-Oriented Programming (ROP)

Novel technique introduced


Large Vision-Language Models (LVLMs) undergo safety alignment to suppress harmful content. However, current defenses predominantly target explicit malicious patterns in the input representation, often overlooking the vulnerabilities inherent in compositional reasoning. In this paper, we identify a systemic flaw where LVLMs can be induced to synthesize harmful logic from benign premises. We formalize this attack paradigm as \textit{Reasoning-Oriented Programming}, drawing a structural analogy to Return-Oriented Programming in systems security. Just as ROP circumvents memory protections by chaining benign instruction sequences, our approach exploits the model's instruction-following capability to orchestrate a semantic collision of orthogonal benign inputs. We instantiate this paradigm via \tool{}, an automated framework that optimizes for \textit{semantic orthogonality} and \textit{spatial isolation}. By generating visual gadgets that are semantically decoupled from the harmful intent and arranging them to prevent premature feature fusion, \tool{} forces the malicious logic to emerge only during the late-stage reasoning process. This effectively bypasses perception-level alignment. We evaluate \tool{} on SafeBench and MM-SafetyBench across 7 state-of-the-art 0.LVLMs, including GPT-4o and Claude 3.7 Sonnet. Our results demonstrate that \tool{} consistently circumvents safety alignment, outperforming the strongest existing baseline by an average of 4.67\% on open-source models and 9.50\% on commercial models.


Key Contributions

  • Introduces Reasoning-Oriented Programming, a novel adversarial paradigm that exploits LVLM compositional reasoning by distributing harmful intent across semantically orthogonal benign visual gadgets.
  • Implements an automated attack framework using semantic gadget mining and gradient-free evolutionary search for control-flow prompt optimization.
  • Achieves SOTA jailbreak performance across 7 LVLMs including GPT-4o and Claude 3.7 Sonnet, outperforming baselines by 4.67% on open-source and 9.50% on commercial models.

🛡️ Threat Analysis

Input Manipulation Attack

The framework constructs optimized visual gadgets — specifically crafted visual inputs to VLMs — that when orchestrated together produce harmful outputs, constituting adversarial visual input manipulation targeting inference-time safety mechanisms.


Details

Domains
visionnlpmultimodal
Model Types
vlmllm
Threat Tags
black_boxinference_timetargeted
Datasets
SafeBenchMM-SafetyBench
Applications
large vision-language modelsmultimodal safety alignment