attack 2025

JPS: Jailbreak Multimodal Large Language Models with Collaborative Visual Perturbation and Textual Steering

Renmiao Chen ^1,2, Shiyao Cui ^1,2, Xuancheng Huang ^1,2, Chengwei Pan ³, Victor Shea-Jay Huang ³, QingLin Zhang ¹, Xuan Ouyang ¹, Zhexin Zhang ¹, Hongning Wang ¹, Minlie Huang ¹

¹ Tsinghua University

² Zhipu AI

³ Beihang University

0 citations

Published on arXiv

2508.05087

Input Manipulation Attack

OWASP ML Top 10 — ML01

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

JPS achieves new state-of-the-art attack success rate and malicious intent fulfillment rate on multiple MLLMs by jointly optimizing adversarial visual perturbations and multi-agent textual steering prompts.

JPS

Novel technique introduced

Jailbreak attacks against multimodal large language Models (MLLMs) are a significant research focus. Current research predominantly focuses on maximizing attack success rate (ASR), often overlooking whether the generated responses actually fulfill the attacker's malicious intent. This oversight frequently leads to low-quality outputs that bypass safety filters but lack substantial harmful content. To address this gap, we propose JPS, \underline{J}ailbreak MLLMs with collaborative visual \underline{P}erturbation and textual \underline{S}teering, which achieves jailbreaks via corporation of visual image and textually steering prompt. Specifically, JPS utilizes target-guided adversarial image perturbations for effective safety bypass, complemented by "steering prompt" optimized via a multi-agent system to specifically guide LLM responses fulfilling the attackers' intent. These visual and textual components undergo iterative co-optimization for enhanced performance. To evaluate the quality of attack outcomes, we propose the Malicious Intent Fulfillment Rate (MIFR) metric, assessed using a Reasoning-LLM-based evaluator. Our experiments show JPS sets a new state-of-the-art in both ASR and MIFR across various MLLMs and benchmarks, with analyses confirming its efficacy. Codes are available at \href{https://github.com/thu-coai/JPS}{https://github.com/thu-coai/JPS}. \color{warningcolor}{Warning: This paper contains potentially sensitive contents.}

Key Contributions

JPS attack combining iteratively co-optimized adversarial image perturbations and multi-agent-optimized steering prompts to jailbreak MLLMs with both high ASR and response quality
Malicious Intent Fulfillment Rate (MIFR) metric assessed via a Reasoning-LLM evaluator to measure whether jailbreak outputs actually fulfill the attacker's harmful intent beyond merely bypassing filters
State-of-the-art results on both ASR and MIFR across multiple MLLMs and benchmarks

🛡️ Threat Analysis

Input Manipulation Attack

Uses target-guided gradient-based adversarial image perturbations on visual inputs to bypass MLLM safety filters — a direct input manipulation attack at inference time on the visual modality of VLMs.

Details

Domains

visionnlpmultimodal

Model Types

vlmllmmultimodal

Threat Tags

white_boxinference_timetargeted

Datasets

AdvBench

Applications

multimodal large language modelsvision-language model safety

Read PDF arXiv DOI Code

JPS: Jailbreak Multimodal Large Language Models with Collaborative Visual Perturbation and Textual Steering

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Phi: Preference Hijacking in Multi-modal Large Language Models at Inference Time

Visual Memory Injection Attacks for Multi-Turn Conversations

Semantic Router: On the Feasibility of Hijacking MLLMs via a Single Adversarial Perturbation

VISOR++: Universal Visual Inputs based Steering for Large Vision Language Models

FORCE: Transferable Visual Jailbreaking Attacks via Feature Over-Reliance CorrEction

FreezeVLA: Action-Freezing Attacks against Vision-Language-Action Models

VisualDAN: Exposing Vulnerabilities in VLMs with Visual-Driven DAN Commands

Towards Effective MLLM Jailbreaking Through Balanced On-Topicness and OOD-Intensity