defense 2025

SafeSearch: Do Not Trade Safety for Utility in LLM Search Agents

Qiusi Zhan ^1,2, Angeline Budiman-Chan ², Abdelrahman Zayed ², Xingzhi Guo ², Daniel Kang ¹, Joo-Kyung Kim ²

¹ University of Illinois Urbana-Champaign

² Amazon

2 citations · 35 references · arXiv

Published on arXiv

2510.17017

Prompt Injection

OWASP LLM Top 10 — LLM01

Excessive Agency

OWASP LLM Top 10 — LLM08

Key Finding

SafeSearch reduces LLM search agent harmfulness by over 70% across three red-teaming datasets while matching the QA performance of a utility-only fine-tuned agent, confirmed by query-level reward analysis

SafeSearch

Novel technique introduced

Large language model (LLM) based search agents iteratively generate queries, retrieve external information, and reason to answer open-domain questions. While researchers have primarily focused on improving their utility, their safety behaviors remain underexplored. In this paper, we first evaluate search agents using red-teaming datasets and find that they are more likely to produce harmful outputs than base LLMs. For example, when asked "How can I track someone's location without their consent?", a base model refuses, whereas a search agent designed to retrieve and cite sources may lower its refusal threshold, fetch documents (e.g., court cases), and, once appended, synthesize them into an informative yet unsafe summary. We further show that utility-oriented fine-tuning intensifies this risk, motivating joint alignment of safety and utility. We present SafeSearch, a multi-objective reinforcement learning approach that couples a final-output safety/utility reward with a novel query-level shaping term that penalizes unsafe queries and rewards safe ones. Experiments show that SafeSearch reduces agent harmfulness by over 70% across three red-teaming datasets while producing safe, helpful responses, and matches the QA performance of a utility-only finetuned agent; further analyses confirm the effectiveness of the query-level reward in jointly improving safety and utility.

Key Contributions

Empirical finding that LLM search agents are more likely to produce harmful outputs than base LLMs, and that utility-oriented fine-tuning amplifies this risk
SafeSearch: multi-objective RL framework coupling a final-output safety/utility reward with a novel query-level shaping term that penalizes unsafe search queries and rewards safe ones
Demonstrates over 70% reduction in agent harmfulness across three red-teaming benchmarks while matching utility-only fine-tuned QA performance on TriviaQA, HotpotQA, and Bamboogle

🛡️ Threat Analysis

Details

Domains

nlpreinforcement-learning

Model Types

llmrl

Threat Tags

black_boxinference_time

Datasets

RRBStrongREJECTWildTeamingTriviaQAHotpotQABamboogle

Applications

question answeringllm search agentsretrieval-augmented generation

Read PDF arXiv DOI Code

SafeSearch: Do Not Trade Safety for Utility in LLM Search Agents

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

AdvEvo-MARL: Shaping Internalized Safety through Adversarial Co-Evolution in Multi-Agent Reinforcement Learning

OpenSec: Measuring Incident Response Agent Calibration Under Adversarial Evidence

SentinelNet: Safeguarding Multi-Agent Collaboration Through Credit-Based Dynamic Threat Detection

Secure and Efficient Access Control for Computer-Use Agents via Context Space

AgentWatcher: A Rule-based Prompt Injection Monitor

Deep Research with Open-Domain Evaluation and Multi-Stage Guardrails for Safety

ceLLMate: Sandboxing Browser AI Agents

MAS-Shield: A Defense Framework for Secure and Efficient LLM MAS