TriPlay-RL: Tri-Role Self-Play Reinforcement Learning for LLM Safety Alignment
Zhewen Tan 1,2,3, Wenhan Yu 1,2, Jianfeng Si 2, Tongxin Liu 1, Kaiqi Guan 1, Huiyan Jin 1, Jiawen Tao 1, Xiaokun Yuan 1, Duohe Ma 3, Xiangzheng Zhang 2, Tong Yang 1, Lin Sun 2
Published on arXiv
2601.18292
Prompt Injection
OWASP LLM Top 10 — LLM01
Key Finding
The red-team attacker achieves 90% ASR against Llama-3.1-Nemotron-Nano-8B-v1 and a 3× improvement over baseline against Qwen3-8B, while the defender gains 10–30% in safety performance without degrading reasoning capability.
TriPlay-RL
Novel technique introduced
In recent years, safety risks associated with large language models have become increasingly prominent, highlighting the urgent need to mitigate the generation of toxic and harmful content. The mainstream paradigm for LLM safety alignment typically adopts a collaborative framework involving three roles: an attacker for adversarial prompt generation, a defender for safety defense, and an evaluator for response assessment. In this paper, we propose a closed-loop reinforcement learning framework called TriPlay-RL that enables iterative and co-improving collaboration among three roles with near-zero manual annotation. Experimental results show that the attacker preserves high output diversity while achieving a 20%-50% improvement in adversarial effectiveness; the defender attains 10%-30% gains in safety performance without degrading general reasoning capability; and the evaluator continuously refines its fine-grained judgment ability through iterations, accurately distinguishing unsafe responses, simple refusals, and useful guidance. Overall, our framework establishes an efficient and scalable paradigm for LLM safety alignment, enabling continuous co-evolution within a unified learning loop.
Key Contributions
- TriPlay-RL: a three-role closed-loop RL framework (attacker, defender, evaluator) that enables iterative co-evolution with near-zero manual annotation, mitigating pattern collapse during red-team training
- Diversity penalties and multi-model adversarial training for the attacker to sustain adversarial effectiveness (20–50% gain in ASR) while preventing prompt convergence
- Three-level reward mechanism for the defender achieving 10–30% safety gains without sacrificing general reasoning, and a multi-expert annotation system to train a robust evaluator