attack 2025

MUSE: MCTS-Driven Red Teaming Framework for Enhanced Multi-Turn Dialogue Safety in Large Language Models

Siyu Yan 1, Long Zeng 1, Xuecheng Wu 2, Chengcheng Han 3, Kongcheng Zhang 4, Chong Peng 3, Xuezhi Cao 3, Xunliang Cai 3, Chenjuan Guo 1

0 citations

α

Published on arXiv

2509.14651

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

MUSE-A successfully identifies multi-turn jailbreak trajectories across various LLMs, while MUSE-D mitigates these vulnerabilities through early dialogue intervention, outperforming single-turn safety defenses.

MUSE (MUSE-A / MUSE-D)

Novel technique introduced


As large language models~(LLMs) become widely adopted, ensuring their alignment with human values is crucial to prevent jailbreaks where adversaries manipulate models to produce harmful content. While most defenses target single-turn attacks, real-world usage often involves multi-turn dialogues, exposing models to attacks that exploit conversational context to bypass safety measures. We introduce MUSE, a comprehensive framework tackling multi-turn jailbreaks from both attack and defense angles. For attacks, we propose MUSE-A, a method that uses frame semantics and heuristic tree search to explore diverse semantic trajectories. For defense, we present MUSE-D, a fine-grained safety alignment approach that intervenes early in dialogues to reduce vulnerabilities. Extensive experiments on various models show that MUSE effectively identifies and mitigates multi-turn vulnerabilities. Code is available at \href{https://github.com/yansiyu02/MUSE}{https://github.com/yansiyu02/MUSE}.


Key Contributions

  • MUSE-A: a multi-turn jailbreak attack using Monte Carlo Tree Search and frame semantics to explore diverse adversarial semantic trajectories across conversational turns
  • MUSE-D: a fine-grained safety alignment defense that detects and intervenes early in multi-turn dialogues to reduce jailbreak vulnerabilities
  • Comprehensive evaluation demonstrating that multi-turn conversational context enables attacks that bypass defenses designed for single-turn interactions

🛡️ Threat Analysis


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
black_boxinference_timetargeted
Applications
conversational aichatbotllm safety alignment