α

Published on arXiv

2508.06755

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Multi-turn jailbreaking poses a more serious threat than single-turn attacks, as jailbroken LLMs continue to produce unsafe responses to both relevant follow-up and unrelated queries.

MTJ-Bench

Novel technique introduced


Current jailbreaking work on large language models (LLMs) aims to elicit unsafe outputs from given prompts. However, it only focuses on single-turn jailbreaking targeting one specific query. On the contrary, the advanced LLMs are designed to handle extremely long contexts and can thus conduct multi-turn conversations. So, we propose exploring multi-turn jailbreaking, in which the jailbroken LLMs are continuously tested on more than the first-turn conversation or a single target query. This is an even more serious threat because 1) it is common for users to continue asking relevant follow-up questions to clarify certain jailbroken details, and 2) it is also possible that the initial round of jailbreaking causes the LLMs to respond to additional irrelevant questions consistently. As the first step (First draft done at June 2024) in exploring multi-turn jailbreaking, we construct a Multi-Turn Jailbreak Benchmark (MTJ-Bench) for benchmarking this setting on a series of open- and closed-source models and provide novel insights into this new safety threat. By revealing this new vulnerability, we aim to call for community efforts to build safer LLMs and pave the way for a more in-depth understanding of jailbreaking LLMs.


Key Contributions

  • Formalizes multi-turn jailbreaking as a distinct and more severe threat than single-turn jailbreaking, covering both follow-up clarification attacks and persistent unsafe responses to unrelated queries
  • Constructs MTJ-Bench, the first benchmark for evaluating multi-turn jailbreak vulnerability across open- and closed-source LLMs
  • Provides empirical insights into how safety alignment degrades or persists across multi-turn conversational contexts

🛡️ Threat Analysis


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
black_boxinference_time
Datasets
MTJ-Bench
Applications
chatbotsllm safety evaluation