Latest papers

2 papers
defense arXiv Jan 26, 2026 · 10w ago

TriPlay-RL: Tri-Role Self-Play Reinforcement Learning for LLM Safety Alignment

Zhewen Tan, Wenhan Yu, Jianfeng Si et al. · Peking University · Qiyuan Tech +1 more

Closed-loop RL framework co-training LLM attacker, defender, and evaluator to iteratively improve safety alignment with minimal annotation

Prompt Injection nlpreinforcement-learning
PDF Code
defense arXiv Aug 12, 2025 · Aug 2025

Efficient Switchable Safety Control in LLMs via Magic-Token-Guided Co-Training

Jianfeng Si, Lin Sun, Zhewen Tan et al. · Qiyuan Tech

Co-training framework embeds switchable safety modes in one LLM via magic tokens, achieving robust jailbreak resistance at lower cost

Prompt Injection nlp
PDF Code