attack 2025

Agentic Reinforcement Learning for Search is Unsafe

Yushi Yang ¹, Shreyansh Padarha ¹, Andrew Lee ², Adam Mahdi ¹

¹ University of Oxford

² Harvard University

1 citations · 35 references · arXiv

Published on arXiv

2510.17431

Prompt Injection

OWASP LLM Top 10 — LLM01

Excessive Agency

OWASP LLM Top 10 — LLM08

Key Finding

Two simple prompt-level attacks reduce refusal rates by up to 60.0% and answer safety by 82.5% across Qwen-2.5-7B and Llama-3.2-3B RL-trained search models with both local and web search.

Search attack / Multi-search attack

Novel technique introduced

Agentic reinforcement learning (RL) trains large language models to autonomously call tools during reasoning, with search as the most common application. These models excel at multi-step reasoning tasks, but their safety properties are not well understood. In this study, we show that RL-trained search models inherit refusal from instruction tuning and often deflect harmful requests by turning them into safe queries. However, this safety is fragile. Two simple attacks, one that forces the model to begin response with search (Search attack), another that encourages models to repeatedly search (Multi-search attack), trigger cascades of harmful searches and answers. Across two model families (Qwen, Llama) with both local and web search, these attacks lower refusal rates by up to 60.0%, answer safety by 82.5%, and search-query safety by 82.4%. The attacks succeed by triggering models to generate harmful, request-mirroring search queries before they can generate the inherited refusal tokens. This exposes a core weakness of current RL training: it rewards continued generation of effective queries without accounting for their harmfulness. As a result, RL search models have vulnerabilities that users can easily exploit, making it urgent to develop safety-aware agentic RL pipelines optimising for safe search.

Key Contributions

Demonstrates that RL-trained search models inherit refusal from instruction tuning but this safety is fragile and easily bypassed
Proposes two simple attacks (Search attack and Multi-search attack) that force harmful search queries before refusal tokens can be generated, degrading refusal rates by up to 60.0%, answer safety by 82.5%, and search-query safety by 82.4%
Identifies a core weakness in current agentic RL training: it rewards continued effective query generation without penalizing harmful queries, creating an exploitable timing vulnerability

🛡️ Threat Analysis

Details

Domains

nlpreinforcement-learning

Model Types

llmtransformerrl

Threat Tags

black_boxinference_timetargeted

Datasets

AdvBench

Applications

llm search agentsagentic ai systemsretrieval-augmented generation

Read PDF arXiv DOI

Agentic Reinforcement Learning for Search is Unsafe

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Optimizing AI Agent Attacks With Synthetic Data

SafeSearch: Do Not Trade Safety for Utility in LLM Search Agents

AdvEvo-MARL: Shaping Internalized Safety through Adversarial Co-Evolution in Multi-Agent Reinforcement Learning

Shadows in the Code: Exploring the Risks and Defenses of LLM-based Multi-Agent Software Development Systems

Deep Research Brings Deeper Harm

When Agents See Humans as the Outgroup: Belief-Dependent Bias in LLM-Powered Agents

From Storage to Steering: Memory Control Flow Attacks on LLM Agents

Zombie Agents: Persistent Control of Self-Evolving LLM Agents via Self-Reinforcing Injections