defense 2026

TriPlay-RL: Tri-Role Self-Play Reinforcement Learning for LLM Safety Alignment

Zhewen Tan 1,2,3, Wenhan Yu 1,2, Jianfeng Si 2, Tongxin Liu 1, Kaiqi Guan 1, Huiyan Jin 1, Jiawen Tao 1, Xiaokun Yuan 1, Duohe Ma 3, Xiangzheng Zhang 2, Tong Yang 1, Lin Sun 2

0 citations · 39 references · arXiv

α

Published on arXiv

2601.18292

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

The red-team attacker achieves 90% ASR against Llama-3.1-Nemotron-Nano-8B-v1 and a 3× improvement over baseline against Qwen3-8B, while the defender gains 10–30% in safety performance without degrading reasoning capability.

TriPlay-RL

Novel technique introduced


In recent years, safety risks associated with large language models have become increasingly prominent, highlighting the urgent need to mitigate the generation of toxic and harmful content. The mainstream paradigm for LLM safety alignment typically adopts a collaborative framework involving three roles: an attacker for adversarial prompt generation, a defender for safety defense, and an evaluator for response assessment. In this paper, we propose a closed-loop reinforcement learning framework called TriPlay-RL that enables iterative and co-improving collaboration among three roles with near-zero manual annotation. Experimental results show that the attacker preserves high output diversity while achieving a 20%-50% improvement in adversarial effectiveness; the defender attains 10%-30% gains in safety performance without degrading general reasoning capability; and the evaluator continuously refines its fine-grained judgment ability through iterations, accurately distinguishing unsafe responses, simple refusals, and useful guidance. Overall, our framework establishes an efficient and scalable paradigm for LLM safety alignment, enabling continuous co-evolution within a unified learning loop.


Key Contributions

  • TriPlay-RL: a three-role closed-loop RL framework (attacker, defender, evaluator) that enables iterative co-evolution with near-zero manual annotation, mitigating pattern collapse during red-team training
  • Diversity penalties and multi-model adversarial training for the attacker to sustain adversarial effectiveness (20–50% gain in ASR) while preventing prompt convergence
  • Three-level reward mechanism for the defender achieving 10–30% safety gains without sacrificing general reasoning, and a multi-expert annotation system to train a robust evaluator

🛡️ Threat Analysis


Details

Domains
nlpreinforcement-learning
Model Types
llmtransformerrl
Threat Tags
black_boxinference_timetraining_time
Datasets
Llama-3.1-Nemotron-Nano-8B-v1 (target model)Qwen3-8B (target model)
Applications
llm safety alignmentred-teamingharmful content mitigation