defense 2025

Safety Alignment of LMs via Non-cooperative Games

Anselm Paulus 1,2, Ilia Kulikov 1, Brandon Amos 1, Rémi Munos 1, Ivan Evtimov 1, Kamalika Chaudhuri 1, Arman Zharmagambetov 2,1

1 citations · 52 references · arXiv

α

Published on arXiv

2512.20806

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

AdvGame simultaneously improves Defender LM safety (lower jailbreak success rate) and utility (higher task accuracy) compared to sequential adversarial training baselines on Qwen2.5-7B-Instruct.

AdvGame

Novel technique introduced


Ensuring the safety of language models (LMs) while maintaining their usefulness remains a critical challenge in AI alignment. Current approaches rely on sequential adversarial training: generating adversarial prompts and fine-tuning LMs to defend against them. We introduce a different paradigm: framing safety alignment as a non-zero-sum game between an Attacker LM and a Defender LM trained jointly via online reinforcement learning. Each LM continuously adapts to the other's evolving strategies, driving iterative improvement. Our method uses a preference-based reward signal derived from pairwise comparisons instead of point-wise scores, providing more robust supervision and potentially reducing reward hacking. Our RL recipe, AdvGame, shifts the Pareto frontier of safety and utility, yielding a Defender LM that is simultaneously more helpful and more resilient to adversarial attacks. In addition, the resulting Attacker LM converges into a strong, general-purpose red-teaming agent that can be directly deployed to probe arbitrary target models.


Key Contributions

  • Frames LLM safety alignment as a non-zero-sum, non-cooperative game between a separate Attacker LM and Defender LM trained concurrently via online RL — replacing sequential alternating optimization.
  • Introduces preference-based (pairwise comparison) reward signals instead of point-wise scalar scores, providing more robust supervision and reducing reward hacking.
  • AdvGame shifts the Pareto frontier of safety and utility, producing a Defender LM that is simultaneously more helpful and more resilient, and an Attacker LM that functions as a strong general-purpose red-teaming agent.

🛡️ Threat Analysis


Details

Domains
nlp
Model Types
llmtransformerrl
Threat Tags
training_timeinference_timeblack_box
Datasets
Qwen2.5-7B-Instruct (model)standard safety benchmarks
Applications
llm safety alignmentred-teamingconversational ai