defense 2026

TriPlay-RL: Tri-Role Self-Play Reinforcement Learning for LLM Safety Alignment

Zhewen Tan ^1,2,3, Wenhan Yu ^1,2, Jianfeng Si ², Tongxin Liu ¹, Kaiqi Guan ¹, Huiyan Jin ¹, Jiawen Tao ¹, Xiaokun Yuan ¹, Duohe Ma ³, Xiangzheng Zhang ², Tong Yang ¹, Lin Sun ²

¹ Peking University

² Qiyuan Tech

³ University of Chinese Academy of Sciences

0 citations · 39 references · arXiv

Published on arXiv

2601.18292

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

The red-team attacker achieves 90% ASR against Llama-3.1-Nemotron-Nano-8B-v1 and a 3× improvement over baseline against Qwen3-8B, while the defender gains 10–30% in safety performance without degrading reasoning capability.

TriPlay-RL

Novel technique introduced

In recent years, safety risks associated with large language models have become increasingly prominent, highlighting the urgent need to mitigate the generation of toxic and harmful content. The mainstream paradigm for LLM safety alignment typically adopts a collaborative framework involving three roles: an attacker for adversarial prompt generation, a defender for safety defense, and an evaluator for response assessment. In this paper, we propose a closed-loop reinforcement learning framework called TriPlay-RL that enables iterative and co-improving collaboration among three roles with near-zero manual annotation. Experimental results show that the attacker preserves high output diversity while achieving a 20%-50% improvement in adversarial effectiveness; the defender attains 10%-30% gains in safety performance without degrading general reasoning capability; and the evaluator continuously refines its fine-grained judgment ability through iterations, accurately distinguishing unsafe responses, simple refusals, and useful guidance. Overall, our framework establishes an efficient and scalable paradigm for LLM safety alignment, enabling continuous co-evolution within a unified learning loop.

Key Contributions

TriPlay-RL: a three-role closed-loop RL framework (attacker, defender, evaluator) that enables iterative co-evolution with near-zero manual annotation, mitigating pattern collapse during red-team training
Diversity penalties and multi-model adversarial training for the attacker to sustain adversarial effectiveness (20–50% gain in ASR) while preventing prompt convergence
Three-level reward mechanism for the defender achieving 10–30% safety gains without sacrificing general reasoning, and a multi-expert annotation system to train a robust evaluator

🛡️ Threat Analysis

Details

Domains

nlpreinforcement-learning

Model Types

llmtransformerrl

Threat Tags

black_boxinference_timetraining_time

Datasets

Llama-3.1-Nemotron-Nano-8B-v1 (target model)Qwen3-8B (target model)

Applications

llm safety alignmentred-teamingharmful content mitigation

Read PDF arXiv DOI Code

TriPlay-RL: Tri-Role Self-Play Reinforcement Learning for LLM Safety Alignment

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Adversarial Reinforcement Learning for Large Language Model Agent Safety

Certifiable Safe RLHF: Fixed-Penalty Constraint Optimization for Safer Language Models

MAGIC: A Co-Evolving Attacker-Defender Adversarial Game for Robust LLM Safety

Output Supervision Can Obfuscate the Chain of Thought

Safety Alignment of LMs via Non-cooperative Games

Online Learning Defense against Iterative Jailbreak Attacks via Prompt Optimization

Forewarned is Forearmed: Pre-Synthesizing Jailbreak-like Instructions to Enhance LLM Safety Guardrail to Potential Attacks

From Shallow to Deep: Pinning Semantic Intent via Causal GRPO