attack 2025

TROJail: Trajectory-Level Optimization for Multi-Turn Large Language Model Jailbreaks with Process Rewards

Xiqiao Xiong 1, Ouxiang Li 1, Zhuo Liu 1, Moxin Li 2, Wentao Shi 1, Fengbin Zhu 2, Qifan Wang 3, Fuli Feng 1

0 citations · 41 references · arXiv (Cornell University)

α

Published on arXiv

2512.07761

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

TROJail achieves improved multi-turn jailbreak attack success rates over existing turn-level optimization baselines across multiple LLMs and benchmarks.

TROJail

Novel technique introduced


Large language models have seen widespread adoption, yet they remain vulnerable to multi-turn jailbreak attacks, threatening their safe deployment. This has led to the task of training automated multi-turn attackers to probe model safety vulnerabilities. However, existing approaches typically rely on turn-level optimization, which is insufficient for learning long-term attack strategies. To bridge this gap, we formulate this task as a multi-turn reinforcement learning problem, directly optimizing the harmfulness of the final-turn response as the outcome reward. To address the sparse supervision of the outcome reward, we introduce TROJail, which employs two process rewards to evaluate the utility of intermediate prompts and integrate them into advantage estimation. These rewards (1) penalize overly harmful prompts that trigger the model's refusal mechanism, and (2) encourage steering the semantic relevance of responses toward the targeted harmful content. Experimental results show improved attack success rates across multiple models and benchmarks, highlighting the effectiveness of our approach. The code is available at https://github.com/xxiqiao/TROJail. Warning: This paper contains examples of harmful content.


Key Contributions

  • Formulates multi-turn LLM jailbreaking as a trajectory-level RL problem, optimizing the harmfulness of the final-turn response as the outcome reward.
  • Introduces two process rewards integrated into advantage estimation: one penalizing premature harmful prompts that trigger refusals, and one rewarding semantic steering toward targeted harmful content.
  • Demonstrates improved attack success rates over turn-level baselines across multiple LLMs and jailbreak benchmarks.

🛡️ Threat Analysis


Details

Domains
nlpreinforcement-learning
Model Types
llmrl
Threat Tags
black_boxinference_timetargeted
Applications
llm safetychatbotsconversational ai systems