Ivan Evtimov

h-index: 11 13,809 citations 48 papers (total)

Papers in Database (3)

attack arXiv Oct 6, 2025 · Oct 2025

RL Is a Hammer and LLMs Are Nails: A Simple Reinforcement Learning Recipe for Strong Prompt Injection

Yuxin Wen, Arman Zharmagambetov, Ivan Evtimov et al. · University of Maryland · Meta

Trains RL attacker from scratch to perform prompt injection, achieving 98% ASR against GPT-4o and bypassing Instruction Hierarchy and SecAlign defenses

Prompt Injection nlp
9 citations PDF Code
defense arXiv Oct 1, 2025 · Oct 2025

Large Reasoning Models Learn Better Alignment from Flawed Thinking

ShengYun Peng, Eric Smith, Ivan Evtimov et al. · Meta · Georgia Institute of Technology +1 more

Defends LLMs against chain-of-thought jailbreaks by RL-training models to self-correct injected flawed reasoning premises

Prompt Injection nlp
7 citations PDF
defense arXiv Dec 23, 2025 · Dec 2025

Safety Alignment of LMs via Non-cooperative Games

Anselm Paulus, Ilia Kulikov, Brandon Amos et al. · Meta · University of Tübingen

Defends LLMs against jailbreaks by jointly training an Attacker and Defender LM as a non-cooperative RL game, shifting the safety-utility Pareto frontier

Prompt Injection nlp
1 citations PDF