attack 2026

SEMA: Simple yet Effective Learning for Multi-Turn Jailbreak Attacks

Mingqian Feng ¹, Xiaodong Liu ², Weiwei Yang ², Jialin Song ², Xuekai Zhu ², Chenliang Xu ¹, Jianfeng Gao ²

¹ University of Rochester

² Microsoft Research

1 citations · 1 influential · 35 references · arXiv (Cornell University)

Published on arXiv

2602.06854

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Achieves 80.1% ASR@1 averaged across three closed- and open-source victim models on AdvBench, outperforming the previous SOTA by 33.9%.

SEMA

Novel technique introduced

Multi-turn jailbreaks capture the real threat model for safety-aligned chatbots, where single-turn attacks are merely a special case. Yet existing approaches break under exploration complexity and intent drift. We propose SEMA, a simple yet effective framework that trains a multi-turn attacker without relying on any existing strategies or external data. SEMA comprises two stages. Prefilling self-tuning enables usable rollouts by fine-tuning on non-refusal, well-structured, multi-turn adversarial prompts that are self-generated with a minimal prefix, thereby stabilizing subsequent learning. Reinforcement learning with intent-drift-aware reward trains the attacker to elicit valid multi-turn adversarial prompts while maintaining the same harmful objective. We anchor harmful intent in multi-turn jailbreaks via an intent-drift-aware reward that combines intent alignment, compliance risk, and level of detail. Our open-loop attack regime avoids dependence on victim feedback, unifies single- and multi-turn settings, and reduces exploration complexity. Across multiple datasets, victim models, and jailbreak judges, our method achieves state-of-the-art (SOTA) attack success rates (ASR), outperforming all single-turn baselines, manually scripted and template-driven multi-turn baselines, as well as our SFT (Supervised Fine-Tuning) and DPO (Direct Preference Optimization) variants. For instance, SEMA performs an average $80.1\%$ ASR@1 across three closed-source and open-source victim models on AdvBench, 33.9% over SOTA. The approach is compact, reproducible, and transfers across targets, providing a stronger and more realistic stress test for large language model (LLM) safety and enabling automatic redteaming to expose and localize failure modes. Our code is available at: https://github.com/fmmarkmq/SEMA.

Key Contributions

Prefilling self-tuning stage that stabilizes rollouts by fine-tuning on self-generated, non-refusal multi-turn adversarial prompts with minimal prefixes
Intent-drift-aware reward combining intent alignment, compliance risk, and level of detail to anchor harmful objectives across multi-turn conversations
Open-loop attack regime that unifies single- and multi-turn settings, avoids victim feedback dependence, and reduces exploration complexity

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llm

Threat Tags

black_boxinference_timetargeted

Datasets

AdvBench

Applications

safety-aligned chatbotslarge language model safety

Read PDF arXiv DOI Code

SEMA: Simple yet Effective Learning for Multi-Turn Jailbreak Attacks

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Casting a SPELL: Sentence Pairing Exploration for LLM Limitation-breaking

The Trojan Knowledge: Bypassing Commercial LLM Guardrails via Harmless Prompt Weaving and Adaptive Tree Search

"To Survive, I Must Defect": Jailbreaking LLMs via the Game-Theory Scenarios

PISmith: Reinforcement Learning-based Red Teaming for Prompt Injection Defenses

When AIOps Become "AI Oops": Subverting LLM-driven IT Operations via Telemetry Manipulation

Evolve the Method, Not the Prompts: Evolutionary Synthesis of Jailbreak Attacks on LLMs

The Echo Chamber Multi-Turn LLM Jailbreak

From Rookie to Expert: Manipulating LLMs for Automated Vulnerability Exploitation in Enterprise Software