benchmark 2025

Evaluating the Robustness of Retrieval-Augmented Generation to Adversarial Evidence in the Health Domain

Shakiba Amirshahi 1, Amin Bigdeli 1, Charles L. A. Clarke 1, Amira Ghenai 2

0 citations

α

Published on arXiv

2509.03787

Input Manipulation Attack

OWASP ML Top 10 — ML01

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Adversarial documents substantially degrade RAG answer alignment with ground truth, but co-presence of helpful documents in the retrieval pool largely preserves robustness.


Retrieval augmented generation (RAG) systems provide a method for factually grounding the responses of a Large Language Model (LLM) by providing retrieved evidence, or context, as support. Guided by this context, RAG systems can reduce hallucinations and expand the ability of LLMs to accurately answer questions outside the scope of their training data. Unfortunately, this design introduces a critical vulnerability: LLMs may absorb and reproduce misinformation present in retrieved evidence. This problem is magnified if retrieved evidence contains adversarial material explicitly intended to promulgate misinformation. This paper presents a systematic evaluation of RAG robustness in the health domain and examines alignment between model outputs and ground-truth answers. We focus on the health domain due to the potential for harm caused by incorrect responses, as well as the availability of evidence-based ground truth for many common health-related questions. We conduct controlled experiments using common health questions, varying both the type and composition of the retrieved documents (helpful, harmful, and adversarial) as well as the framing of the question by the user (consistent, neutral, and inconsistent). Our findings reveal that adversarial documents substantially degrade alignment, but robustness can be preserved when helpful evidence is also present in the retrieval pool. These findings offer actionable insights for designing safer RAG systems in high-stakes domains by highlighting the need for retrieval safeguards. To enable reproducibility and facilitate future research, all experimental results are publicly available in our github repository. https://github.com/shakibaam/RAG_ROBUSTNESS_EVAL


Key Contributions

  • Systematic controlled evaluation of RAG robustness varying document composition (helpful, harmful, adversarial) and query framing (consistent, neutral, inconsistent) in the health domain
  • Empirical finding that adversarial documents significantly degrade alignment with ground-truth health answers, but robustness is preserved when helpful evidence also appears in the retrieval pool
  • Actionable insights for RAG system design highlighting the critical role of retrieval safeguards in high-stakes domains

🛡️ Threat Analysis

Input Manipulation Attack

Adversarial document injection into a RAG retrieval pool is explicitly listed as a dual ML01+LLM01 case — inputs are strategically crafted to manipulate the LLM-integrated system's outputs at inference time.


Details

Domains
nlp
Model Types
llm
Threat Tags
black_boxinference_timetargeted
Datasets
Health Misinformation Track
Applications
health question answeringrag systems