attack 2025

When Search Goes Wrong: Red-Teaming Web-Augmented Large Language Models

1 citations · 42 references · arXiv

Published on arXiv

2510.09689

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

CREST-Search effectively bypasses safety filters in web-search-based LLM systems by generating adversarial queries that induce retrieval and citation of harmful web content.

CREST-Search

Novel technique introduced

Large Language Models (LLMs) have been augmented with web search to overcome the limitations of the static knowledge boundary by accessing up-to-date information from the open Internet. While this integration enhances model capability, it also introduces a distinct safety threat surface: the retrieval and citation process has the potential risk of exposing users to harmful or low-credibility web content. Existing red-teaming methods are largely designed for standalone LLMs as they primarily focus on unsafe generation, ignoring risks emerging from the complex search workflow. To address this gap, we propose CREST-Search, a pioneering red-teaming framework for LLMs with web search. The cornerstone of CREST-Search is three novel attack strategies that generate seemingly benign search queries yet induce unsafe citations. It also employs an iterative in-context refinement mechanism to strengthen adversarial effectiveness under black-box constraints. In addition, we construct a search-specific harmful dataset, WebSearch-Harm, which enables fine-tuning a specialized red-teaming model to improve query quality. Our experiments demonstrate that CREST-Search can effectively bypass safety filters and systematically expose vulnerabilities in web search-based LLM systems, underscoring the necessity of the development of robust search models.

Key Contributions

CREST-Search: three novel attack strategies generating benign-looking search queries that induce unsafe citations in web-augmented LLMs
Iterative in-context refinement mechanism to improve adversarial effectiveness under black-box constraints
WebSearch-Harm dataset for fine-tuning a specialized red-teaming model targeting search-specific harmful queries

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llm

Threat Tags

black_boxinference_time

Datasets

WebSearch-Harm

Applications

web-augmented llm search systemsretrieval-augmented generation

Read PDF arXiv DOI

When Search Goes Wrong: Red-Teaming Web-Augmented Large Language Models

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Beyond Context: Large Language Models Failure to Grasp Users Intent

VortexPIA: Indirect Prompt Injection Attack against LLMs for Efficient Extraction of User Privacy

Stand on The Shoulders of Giants: Building JailExpert from Previous Attack Experience

Adversarial versification in portuguese as a jailbreak operator in LLMs

Red-Bandit: Test-Time Adaptation for LLM Red-Teaming via Bandit-Guided LoRA Experts

Involuntary Jailbreak: On Self-Prompting Attacks

AutoRed: A Free-form Adversarial Prompt Generation Framework for Automated Red Teaming

In-Context Environments Induce Evaluation-Awareness in Language Models