defense 2026

Multi-turn Jailbreaking Attack in Multi-Modal Large Language Models

Badhan Chandra Das , Md Tasnim Jawad , Joaquin Molto , M. Hadi Amini , Yanzhao Wu

1 citations · 56 references · arXiv

α

Published on arXiv

2601.05339

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Multi-turn jailbreaking significantly increases attack success rates on MLLMs compared to single-turn attacks, and FragGuard effectively mitigates these attacks without requiring model retraining.

FragGuard

Novel technique introduced


In recent years, the security vulnerabilities of Multi-modal Large Language Models (MLLMs) have become a serious concern in the Generative Artificial Intelligence (GenAI) research. These highly intelligent models, capable of performing multi-modal tasks with high accuracy, are also severely susceptible to carefully launched security attacks, such as jailbreaking attacks, which can manipulate model behavior and bypass safety constraints. This paper introduces MJAD-MLLMs, a holistic framework that systematically analyzes the proposed Multi-turn Jailbreaking Attacks and multi-LLM-based defense techniques for MLLMs. In this paper, we make three original contributions. First, we introduce a novel multi-turn jailbreaking attack to exploit the vulnerabilities of the MLLMs under multi-turn prompting. Second, we propose a novel fragment-optimized and multi-LLM defense mechanism, called FragGuard, to effectively mitigate jailbreaking attacks in the MLLMs. Third, we evaluate the efficacy of the proposed attacks and defenses through extensive experiments on several state-of-the-art (SOTA) open-source and closed-source MLLMs and benchmark datasets, and compare their performance with the existing techniques.


Key Contributions

  • Novel multi-turn jailbreaking attack exploiting MLLM vulnerabilities across multiple conversation turns with multimodal adversarial prompts
  • FragGuard: a fragment-optimized, multi-LLM defense mechanism that mitigates jailbreaking attacks without requiring model training or fine-tuning
  • Multi-LLM-based evaluation framework for assessing attack severity and defense effectiveness on open-source and closed-source MLLMs

🛡️ Threat Analysis


Details

Domains
nlpmultimodal
Model Types
vlmllm
Threat Tags
black_boxinference_time
Applications
multimodal large language modelsvisual question answeringchatbots