Vaishakh Keshava

h-index: 3 1,995 citations 9 papers (total)

Papers in Database (2)

defense arXiv Oct 6, 2025 · Oct 2025

Adversarial Reinforcement Learning for Large Language Model Agent Safety

Zizhao Wang, Dingcheng Li, Vaishakh Keshava et al. · Google · The University of Texas at Austin +2 more

Defends LLM tool-using agents from indirect prompt injection via adversarial RL co-training in a two-player zero-sum game

Prompt Injection nlpreinforcement-learning
3 citations PDF
defense arXiv Feb 9, 2026 · 8w ago

Reinforcement Learning with Backtracking Feedback

Bilgehan Sel, Vaishakh Keshava, Phillip Wallis et al. · Google · Virginia Tech +1 more

Trains LLMs to self-correct safety violations mid-generation via RL and a 'backtrack by x tokens' signal, reducing GCG and jailbreak attack success rates

Input Manipulation Attack Prompt Injection nlp
PDF