defense 2025

ReliabilityRAG: Effective and Provably Robust Defense for RAG-based Web-Search

Zeyu Shen ¹, Basileal Imana ¹, Tong Wu ¹, Chong Xiang ², Prateek Mittal ¹, Aleksandra Korolova ¹

¹ Princeton University

² NVIDIA

2 citations · 69 references · arXiv

Published on arXiv

2509.23519

Input Manipulation Attack

OWASP ML Top 10 — ML01

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

ReliabilityRAG provides provably robust and empirically superior defense against RAG corpus poisoning compared to RobustRAG, while maintaining high benign accuracy and excelling at long-form generation tasks where prior methods failed.

ReliabilityRAG

Novel technique introduced

Retrieval-Augmented Generation (RAG) enhances Large Language Models by grounding their outputs in external documents. These systems, however, remain vulnerable to attacks on the retrieval corpus, such as prompt injection. RAG-based search systems (e.g., Google's Search AI Overview) present an interesting setting for studying and protecting against such threats, as defense algorithms can benefit from built-in reliability signals -- like document ranking -- and represent a non-LLM challenge for the adversary due to decades of work to thwart SEO. Motivated by, but not limited to, this scenario, this work introduces ReliabilityRAG, a framework for adversarial robustness that explicitly leverages reliability information of retrieved documents. Our first contribution adopts a graph-theoretic perspective to identify a "consistent majority" among retrieved documents to filter out malicious ones. We introduce a novel algorithm based on finding a Maximum Independent Set (MIS) on a document graph where edges encode contradiction. Our MIS variant explicitly prioritizes higher-reliability documents and provides provable robustness guarantees against bounded adversarial corruption under natural assumptions. Recognizing the computational cost of exact MIS for large retrieval sets, our second contribution is a scalable weighted sample and aggregate framework. It explicitly utilizes reliability information, preserving some robustness guarantees while efficiently handling many documents. We present empirical results showing ReliabilityRAG provides superior robustness against adversarial attacks compared to prior methods, maintains high benign accuracy, and excels in long-form generation tasks where prior robustness-focused methods struggled. Our work is a significant step towards more effective, provably robust defenses against retrieved corpus corruption in RAG.

Key Contributions

Graph-theoretic Maximum Independent Set (MIS) algorithm that identifies a 'consistent majority' among retrieved documents to filter adversarially injected ones, with explicit prioritization of higher-reliability documents
Scalable weighted sample-and-aggregate framework that preserves robustness guarantees while efficiently handling large document sets
Provable robustness guarantees against bounded adversarial corpus corruption under natural assumptions, with empirical superiority over prior methods including on long-form generation tasks

🛡️ Threat Analysis

Input Manipulation Attack

Adversarial document injection into a RAG retrieval corpus is explicitly identified as the threat model — strategically crafted documents are injected to manipulate LLM-integrated system outputs (adversarial SEO poisoning, corpus poisoning). Per guidelines, adversarial content manipulation of LLM-integrated systems such as RAG document injection tags ML01.

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

black_boxinference_timedigital

Applications

rag-based web searchllm-augmented search engines

Read PDF arXiv DOI

ReliabilityRAG: Effective and Provably Robust Defense for RAG-based Web-Search

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

ExplainableGuard: Interpretable Adversarial Defense for Large Language Models Using Chain-of-Thought Reasoning

Monotonicity as an Architectural Bias for Robust Language Models

Towards Realistic Guarantees: A Probabilistic Certificate for SmoothLLM

CluCERT: Certifying LLM Robustness via Clustering-Guided Denoising Smoothing

Inverse Language Modeling towards Robust and Grounded LLMs

Addressing Corpus Knowledge Poisoning Attacks on RAG Using Sparse Attention

Provable Defense Framework for LLM Jailbreaks via Noise-Augumented Alignment

CCFC: Core & Core-Full-Core Dual-Track Defense for LLM Jailbreak Protection