Ivan Evtimov

attack arXiv Oct 6, 2025 · Oct 2025

RL Is a Hammer and LLMs Are Nails: A Simple Reinforcement Learning Recipe for Strong Prompt Injection

Yuxin Wen, Arman Zharmagambetov, Ivan Evtimov et al. · University of Maryland · Meta

Trains RL attacker from scratch to perform prompt injection, achieving 98% ASR against GPT-4o and bypassing Instruction Hierarchy and SecAlign defenses

Prompt Injection nlp

9 citations PDF Code

defense arXiv Oct 1, 2025 · Oct 2025

Large Reasoning Models Learn Better Alignment from Flawed Thinking

ShengYun Peng, Eric Smith, Ivan Evtimov et al. · Meta · Georgia Institute of Technology +1 more

Defends LLMs against chain-of-thought jailbreaks by RL-training models to self-correct injected flawed reasoning premises

Prompt Injection nlp

7 citations PDF

defense arXiv Dec 23, 2025 · Dec 2025

Safety Alignment of LMs via Non-cooperative Games

Anselm Paulus, Ilia Kulikov, Brandon Amos et al. · Meta · University of Tübingen

Defends LLMs against jailbreaks by jointly training an Attacker and Defender LM as a non-cooperative RL game, shifting the safety-utility Pareto frontier

Prompt Injection nlp

1 citations PDF

Papers in Database (3)

RL Is a Hammer and LLMs Are Nails: A Simple Reinforcement Learning Recipe for Strong Prompt Injection

Large Reasoning Models Learn Better Alignment from Flawed Thinking

Safety Alignment of LMs via Non-cooperative Games