attack 2025

TROJail: Trajectory-Level Optimization for Multi-Turn Large Language Model Jailbreaks with Process Rewards

Xiqiao Xiong ¹, Ouxiang Li ¹, Zhuo Liu ¹, Moxin Li ², Wentao Shi ¹, Fengbin Zhu ², Qifan Wang ³, Fuli Feng ¹

¹ University of Science and Technology of China

² National University of Singapore

³ Meta AI

0 citations · 41 references · arXiv (Cornell University)

Published on arXiv

2512.07761

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

TROJail achieves improved multi-turn jailbreak attack success rates over existing turn-level optimization baselines across multiple LLMs and benchmarks.

TROJail

Novel technique introduced

Large language models have seen widespread adoption, yet they remain vulnerable to multi-turn jailbreak attacks, threatening their safe deployment. This has led to the task of training automated multi-turn attackers to probe model safety vulnerabilities. However, existing approaches typically rely on turn-level optimization, which is insufficient for learning long-term attack strategies. To bridge this gap, we formulate this task as a multi-turn reinforcement learning problem, directly optimizing the harmfulness of the final-turn response as the outcome reward. To address the sparse supervision of the outcome reward, we introduce TROJail, which employs two process rewards to evaluate the utility of intermediate prompts and integrate them into advantage estimation. These rewards (1) penalize overly harmful prompts that trigger the model's refusal mechanism, and (2) encourage steering the semantic relevance of responses toward the targeted harmful content. Experimental results show improved attack success rates across multiple models and benchmarks, highlighting the effectiveness of our approach. The code is available at https://github.com/xxiqiao/TROJail. Warning: This paper contains examples of harmful content.

Key Contributions

Formulates multi-turn LLM jailbreaking as a trajectory-level RL problem, optimizing the harmfulness of the final-turn response as the outcome reward.
Introduces two process rewards integrated into advantage estimation: one penalizing premature harmful prompts that trigger refusals, and one rewarding semantic steering toward targeted harmful content.
Demonstrates improved attack success rates over turn-level baselines across multiple LLMs and jailbreak benchmarks.

🛡️ Threat Analysis

Details

Domains

nlpreinforcement-learning

Model Types

llmrl

Threat Tags

black_boxinference_timetargeted

Applications

llm safetychatbotsconversational ai systems

Read PDF arXiv DOI Code

TROJail: Trajectory-Level Optimization for Multi-Turn Large Language Model Jailbreaks with Process Rewards

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Automatic LLM Red Teaming

Formalization Driven LLM Prompt Jailbreaking via Reinforcement Learning

Auto-RT: Automatic Jailbreak Strategy Exploration for Red-Teaming Large Language Models

MAGIC: A Co-Evolving Attacker-Defender Adversarial Game for Robust LLM Safety

Boundary Point Jailbreaking of Black-Box LLMs

Safe2Harm: Semantic Isomorphism Attacks for Jailbreaking Large Language Models

Pattern Enhanced Multi-Turn Jailbreaking: Exploiting Structural Vulnerabilities in Large Language Models

The Echo Chamber Multi-Turn LLM Jailbreak