Safety Alignment of LMs via Non-cooperative Games

Ensuring the safety of language models (LMs) while maintaining their usefulness remains a critical challenge in AI alignment. Current approaches rely on sequential adversarial training: generating adversarial prompts and fine-tuning LMs to defend against them. We introduce a different paradigm: framing safety alignment as a non-zero-sum game between an Attacker LM and a Defender LM trained jointly via online reinforcement learning. Each LM continuously adapts to the other's evolving strategies, driving iterative improvement. Our method uses a preference-based reward signal derived from pairwise comparisons instead of point-wise scores, providing more robust supervision and potentially reducing reward hacking. Our RL recipe, AdvGame, shifts the Pareto frontier of safety and utility, yielding a Defender LM that is simultaneously more helpful and more resilient to adversarial attacks. In addition, the resulting Attacker LM converges into a strong, general-purpose red-teaming agent that can be directly deployed to probe arbitrary target models.

Key Contributions

Frames LLM safety alignment as a non-zero-sum, non-cooperative game between a separate Attacker LM and Defender LM trained concurrently via online RL — replacing sequential alternating optimization.
Introduces preference-based (pairwise comparison) reward signals instead of point-wise scalar scores, providing more robust supervision and reducing reward hacking.
AdvGame shifts the Pareto frontier of safety and utility, producing a Defender LM that is simultaneously more helpful and more resilient, and an Attacker LM that functions as a strong general-purpose red-teaming agent.

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llmtransformerrl

Threat Tags

training_timeinference_timeblack_box

Datasets

Qwen2.5-7B-Instruct (model)standard safety benchmarks

Applications

llm safety alignmentred-teamingconversational ai

2025 3 cit.

92%