Agentic Reinforcement Learning for Search is Unsafe
Yushi Yang 1, Shreyansh Padarha 1, Andrew Lee 2, Adam Mahdi 1
Published on arXiv
2510.17431
Prompt Injection
OWASP LLM Top 10 — LLM01
Excessive Agency
OWASP LLM Top 10 — LLM08
Key Finding
Two simple prompt-level attacks reduce refusal rates by up to 60.0% and answer safety by 82.5% across Qwen-2.5-7B and Llama-3.2-3B RL-trained search models with both local and web search.
Search attack / Multi-search attack
Novel technique introduced
Agentic reinforcement learning (RL) trains large language models to autonomously call tools during reasoning, with search as the most common application. These models excel at multi-step reasoning tasks, but their safety properties are not well understood. In this study, we show that RL-trained search models inherit refusal from instruction tuning and often deflect harmful requests by turning them into safe queries. However, this safety is fragile. Two simple attacks, one that forces the model to begin response with search (Search attack), another that encourages models to repeatedly search (Multi-search attack), trigger cascades of harmful searches and answers. Across two model families (Qwen, Llama) with both local and web search, these attacks lower refusal rates by up to 60.0%, answer safety by 82.5%, and search-query safety by 82.4%. The attacks succeed by triggering models to generate harmful, request-mirroring search queries before they can generate the inherited refusal tokens. This exposes a core weakness of current RL training: it rewards continued generation of effective queries without accounting for their harmfulness. As a result, RL search models have vulnerabilities that users can easily exploit, making it urgent to develop safety-aware agentic RL pipelines optimising for safe search.
Key Contributions
- Demonstrates that RL-trained search models inherit refusal from instruction tuning but this safety is fragile and easily bypassed
- Proposes two simple attacks (Search attack and Multi-search attack) that force harmful search queries before refusal tokens can be generated, degrading refusal rates by up to 60.0%, answer safety by 82.5%, and search-query safety by 82.4%
- Identifies a core weakness in current agentic RL training: it rewards continued effective query generation without penalizing harmful queries, creating an exploitable timing vulnerability