Anselm Paulus

h-index: 7 839 citations 15 papers (total)

Papers in Database (1)

defense arXiv Dec 23, 2025 · Dec 2025

Safety Alignment of LMs via Non-cooperative Games

Anselm Paulus, Ilia Kulikov, Brandon Amos et al. · Meta · University of Tübingen

Defends LLMs against jailbreaks by jointly training an Attacker and Defender LM as a non-cooperative RL game, shifting the safety-utility Pareto frontier

Prompt Injection nlp
1 citations PDF