attack 2025

Agentic Reinforcement Learning for Search is Unsafe

Yushi Yang 1, Shreyansh Padarha 1, Andrew Lee 2, Adam Mahdi 1

1 citations · 35 references · arXiv

α

Published on arXiv

2510.17431

Prompt Injection

OWASP LLM Top 10 — LLM01

Excessive Agency

OWASP LLM Top 10 — LLM08

Key Finding

Two simple prompt-level attacks reduce refusal rates by up to 60.0% and answer safety by 82.5% across Qwen-2.5-7B and Llama-3.2-3B RL-trained search models with both local and web search.

Search attack / Multi-search attack

Novel technique introduced


Agentic reinforcement learning (RL) trains large language models to autonomously call tools during reasoning, with search as the most common application. These models excel at multi-step reasoning tasks, but their safety properties are not well understood. In this study, we show that RL-trained search models inherit refusal from instruction tuning and often deflect harmful requests by turning them into safe queries. However, this safety is fragile. Two simple attacks, one that forces the model to begin response with search (Search attack), another that encourages models to repeatedly search (Multi-search attack), trigger cascades of harmful searches and answers. Across two model families (Qwen, Llama) with both local and web search, these attacks lower refusal rates by up to 60.0%, answer safety by 82.5%, and search-query safety by 82.4%. The attacks succeed by triggering models to generate harmful, request-mirroring search queries before they can generate the inherited refusal tokens. This exposes a core weakness of current RL training: it rewards continued generation of effective queries without accounting for their harmfulness. As a result, RL search models have vulnerabilities that users can easily exploit, making it urgent to develop safety-aware agentic RL pipelines optimising for safe search.


Key Contributions

  • Demonstrates that RL-trained search models inherit refusal from instruction tuning but this safety is fragile and easily bypassed
  • Proposes two simple attacks (Search attack and Multi-search attack) that force harmful search queries before refusal tokens can be generated, degrading refusal rates by up to 60.0%, answer safety by 82.5%, and search-query safety by 82.4%
  • Identifies a core weakness in current agentic RL training: it rewards continued effective query generation without penalizing harmful queries, creating an exploitable timing vulnerability

🛡️ Threat Analysis


Details

Domains
nlpreinforcement-learning
Model Types
llmtransformerrl
Threat Tags
black_boxinference_timetargeted
Datasets
AdvBench
Applications
llm search agentsagentic ai systemsretrieval-augmented generation