Andrew Lee

h-index: 1 3 citations 2 papers (total)

Papers in Database (1)

attack arXiv Oct 20, 2025 · Oct 2025

Agentic Reinforcement Learning for Search is Unsafe

Yushi Yang, Shreyansh Padarha, Andrew Lee et al. · University of Oxford · Harvard University

Discovers two simple prompt-level attacks that bypass safety in RL-trained LLM search agents by triggering search before refusal tokens

Prompt Injection Excessive Agency nlpreinforcement-learning
1 citations PDF