attack 2025

MUSE: MCTS-Driven Red Teaming Framework for Enhanced Multi-Turn Dialogue Safety in Large Language Models

0 citations

Published on arXiv

2509.14651

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

MUSE-A successfully identifies multi-turn jailbreak trajectories across various LLMs, while MUSE-D mitigates these vulnerabilities through early dialogue intervention, outperforming single-turn safety defenses.

MUSE (MUSE-A / MUSE-D)

Novel technique introduced

As large language models~(LLMs) become widely adopted, ensuring their alignment with human values is crucial to prevent jailbreaks where adversaries manipulate models to produce harmful content. While most defenses target single-turn attacks, real-world usage often involves multi-turn dialogues, exposing models to attacks that exploit conversational context to bypass safety measures. We introduce MUSE, a comprehensive framework tackling multi-turn jailbreaks from both attack and defense angles. For attacks, we propose MUSE-A, a method that uses frame semantics and heuristic tree search to explore diverse semantic trajectories. For defense, we present MUSE-D, a fine-grained safety alignment approach that intervenes early in dialogues to reduce vulnerabilities. Extensive experiments on various models show that MUSE effectively identifies and mitigates multi-turn vulnerabilities. Code is available at \href{https://github.com/yansiyu02/MUSE}{https://github.com/yansiyu02/MUSE}.

Key Contributions

MUSE-A: a multi-turn jailbreak attack using Monte Carlo Tree Search and frame semantics to explore diverse adversarial semantic trajectories across conversational turns
MUSE-D: a fine-grained safety alignment defense that detects and intervenes early in multi-turn dialogues to reduce jailbreak vulnerabilities
Comprehensive evaluation demonstrating that multi-turn conversational context enables attacks that bypass defenses designed for single-turn interactions

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

black_boxinference_timetargeted

Applications

conversational aichatbotllm safety alignment

Read PDF arXiv Code

MUSE: MCTS-Driven Red Teaming Framework for Enhanced Multi-Turn Dialogue Safety in Large Language Models

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

From Adversarial Poetry to Adversarial Tales: An Interpretability Research Agenda

Structured Semantic Cloaking for Jailbreak Attacks on Large Language Models

ICL-EVADER: Zero-Query Black-Box Evasion Attacks on In-Context Learning and Their Defenses

Lying with Truths: Open-Channel Multi-Agent Collusion for Belief Manipulation via Generative Montage

Path Drift in Large Reasoning Models:How First-Person Commitments Override Safety

Distillability of LLM Security Logic: Predicting Attack Success Rate of Outline Filling Attack via Ranking Regression

Is Reasoning Capability Enough for Safety in Long-Context Language Models?

Trojan Horses in Recruiting: A Red-Teaming Case Study on Indirect Prompt Injection in Standard vs. Reasoning Models