attack 2025

Enhanced MLLM Black-Box Jailbreaking Attacks and Defenses

Xingwei Zhong , Kar Wai Fok , Vrizlynn L. L. Thing

0 citations · 49 references · arXiv

α

Published on arXiv

2510.21214

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Re-attack achieves over 70% average Attack Success Rate on all tested open-source MLLMs and approximately 4.6× higher ASR on GPT-4o compared to the baseline HADES attack.

Re-attack

Novel technique introduced


Multimodal large language models (MLLMs) comprise of both visual and textual modalities to process vision language tasks. However, MLLMs are vulnerable to security-related issues, such as jailbreak attacks that alter the model's input to induce unauthorized or harmful responses. The incorporation of the additional visual modality introduces new dimensions to security threats. In this paper, we proposed a black-box jailbreak method via both text and image prompts to evaluate MLLMs. In particular, we designed text prompts with provocative instructions, along with image prompts that introduced mutation and multi-image capabilities. To strengthen the evaluation, we also designed a Re-attack strategy. Empirical results show that our proposed work can improve capabilities to assess the security of both open-source and closed-source MLLMs. With that, we identified gaps in existing defense methods to propose new strategies for both training-time and inference-time defense methods, and evaluated them across the new jailbreak methods. The experiment results showed that the re-designed defense methods improved protections against the jailbreak attacks.


Key Contributions

  • Re-attack: a two-phase black-box jailbreak strategy that uses HADES image/text prompts for the initial attack, then applies enhanced text prompts with provocative instructions and mutation/multi-image image prompts to recover failed jailbreak cases
  • Comprehensive evaluation across 5 open-source MLLMs (LLaVA-NeXT, MiniGPT variants, DeepSeek-VL2) and GPT-4o, achieving >70% ASR on open-source models and ~4.6× ASR improvement on GPT-4o over HADES
  • Redesigned training-time and inference-time defenses (enhanced AdaShield) evaluated against both HADES and the proposed Re-attack method, demonstrating improved protection

🛡️ Threat Analysis


Details

Domains
multimodalvisionnlp
Model Types
vlmllmmultimodal
Threat Tags
black_boxinference_time
Datasets
MMSafetyBench
Applications
multimodal question answeringvision-language modelschatbot safety