Safety Alignment of LMs via Non-cooperative Games
Anselm Paulus 1,2, Ilia Kulikov 1, Brandon Amos 1, Rémi Munos 1, Ivan Evtimov 1, Kamalika Chaudhuri 1, Arman Zharmagambetov 2,1
Published on arXiv
2512.20806
Prompt Injection
OWASP LLM Top 10 — LLM01
Key Finding
AdvGame simultaneously improves Defender LM safety (lower jailbreak success rate) and utility (higher task accuracy) compared to sequential adversarial training baselines on Qwen2.5-7B-Instruct.
AdvGame
Novel technique introduced
Ensuring the safety of language models (LMs) while maintaining their usefulness remains a critical challenge in AI alignment. Current approaches rely on sequential adversarial training: generating adversarial prompts and fine-tuning LMs to defend against them. We introduce a different paradigm: framing safety alignment as a non-zero-sum game between an Attacker LM and a Defender LM trained jointly via online reinforcement learning. Each LM continuously adapts to the other's evolving strategies, driving iterative improvement. Our method uses a preference-based reward signal derived from pairwise comparisons instead of point-wise scores, providing more robust supervision and potentially reducing reward hacking. Our RL recipe, AdvGame, shifts the Pareto frontier of safety and utility, yielding a Defender LM that is simultaneously more helpful and more resilient to adversarial attacks. In addition, the resulting Attacker LM converges into a strong, general-purpose red-teaming agent that can be directly deployed to probe arbitrary target models.
Key Contributions
- Frames LLM safety alignment as a non-zero-sum, non-cooperative game between a separate Attacker LM and Defender LM trained concurrently via online RL — replacing sequential alternating optimization.
- Introduces preference-based (pairwise comparison) reward signals instead of point-wise scalar scores, providing more robust supervision and reducing reward hacking.
- AdvGame shifts the Pareto frontier of safety and utility, producing a Defender LM that is simultaneously more helpful and more resilient, and an Attacker LM that functions as a strong general-purpose red-teaming agent.