JPS: Jailbreak Multimodal Large Language Models with Collaborative Visual Perturbation and Textual Steering
Renmiao Chen 1,2, Shiyao Cui 1,2, Xuancheng Huang 1,2, Chengwei Pan 3, Victor Shea-Jay Huang 3, QingLin Zhang 1, Xuan Ouyang 1, Zhexin Zhang 1, Hongning Wang 1, Minlie Huang 1
Published on arXiv
2508.05087
Input Manipulation Attack
OWASP ML Top 10 — ML01
Prompt Injection
OWASP LLM Top 10 — LLM01
Key Finding
JPS achieves new state-of-the-art attack success rate and malicious intent fulfillment rate on multiple MLLMs by jointly optimizing adversarial visual perturbations and multi-agent textual steering prompts.
JPS
Novel technique introduced
Jailbreak attacks against multimodal large language Models (MLLMs) are a significant research focus. Current research predominantly focuses on maximizing attack success rate (ASR), often overlooking whether the generated responses actually fulfill the attacker's malicious intent. This oversight frequently leads to low-quality outputs that bypass safety filters but lack substantial harmful content. To address this gap, we propose JPS, \underline{J}ailbreak MLLMs with collaborative visual \underline{P}erturbation and textual \underline{S}teering, which achieves jailbreaks via corporation of visual image and textually steering prompt. Specifically, JPS utilizes target-guided adversarial image perturbations for effective safety bypass, complemented by "steering prompt" optimized via a multi-agent system to specifically guide LLM responses fulfilling the attackers' intent. These visual and textual components undergo iterative co-optimization for enhanced performance. To evaluate the quality of attack outcomes, we propose the Malicious Intent Fulfillment Rate (MIFR) metric, assessed using a Reasoning-LLM-based evaluator. Our experiments show JPS sets a new state-of-the-art in both ASR and MIFR across various MLLMs and benchmarks, with analyses confirming its efficacy. Codes are available at \href{https://github.com/thu-coai/JPS}{https://github.com/thu-coai/JPS}. \color{warningcolor}{Warning: This paper contains potentially sensitive contents.}
Key Contributions
- JPS attack combining iteratively co-optimized adversarial image perturbations and multi-agent-optimized steering prompts to jailbreak MLLMs with both high ASR and response quality
- Malicious Intent Fulfillment Rate (MIFR) metric assessed via a Reasoning-LLM evaluator to measure whether jailbreak outputs actually fulfill the attacker's harmful intent beyond merely bypassing filters
- State-of-the-art results on both ASR and MIFR across multiple MLLMs and benchmarks
🛡️ Threat Analysis
Uses target-guided gradient-based adversarial image perturbations on visual inputs to bypass MLLM safety filters — a direct input manipulation attack at inference time on the visual modality of VLMs.