benchmark 2025

RAG Makes Guardrails Unsafe? Investigating Robustness of Guardrails under RAG-style Contexts

Yining She 1,2, Daniel W. Peterson 2, Marianne Menglin Liu 2, Vikas Upadhyay 2, Mohammad Hossein Chaghazardi 2, Eunsuk Kang 1, Dan Roth 2,3

0 citations · 43 references · arXiv

α

Published on arXiv

2510.05310

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Inserting benign retrieved documents into guardrail context causes ~11% judgment flips for input guardrails and ~8% for output guardrails across Llama Guard and GPT-oss models, with tested mitigations providing only minor relief.


With the increasing adoption of large language models (LLMs), ensuring the safety of LLM systems has become a pressing concern. External LLM-based guardrail models have emerged as a popular solution to screen unsafe inputs and outputs, but they are themselves fine-tuned or prompt-engineered LLMs that are vulnerable to data distribution shifts. In this paper, taking Retrieval Augmentation Generation (RAG) as a case study, we investigated how robust LLM-based guardrails are against additional information embedded in the context. Through a systematic evaluation of 3 Llama Guards and 2 GPT-oss models, we confirmed that inserting benign documents into the guardrail context alters the judgments of input and output guardrails in around 11% and 8% of cases, making them unreliable. We separately analyzed the effect of each component in the augmented context: retrieved documents, user query, and LLM-generated response. The two mitigation methods we tested only bring minor improvements. These results expose a context-robustness gap in current guardrails and motivate training and evaluation protocols that are robust to retrieval and query composition.


Key Contributions

  • Systematic evaluation showing benign RAG-retrieved documents alter LLM guardrail safety judgments in ~11% (input) and ~8% (output) of cases across 5 models (3 Llama Guard variants, 2 GPT-oss)
  • Ablation analysis isolating the effect of individual RAG context components (retrieved documents, user query, LLM-generated response) on guardrail judgment shifts
  • Evaluation of two mitigation strategies, finding only marginal improvements and motivating robustness-aware guardrail training and evaluation protocols

🛡️ Threat Analysis


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
black_boxinference_time
Datasets
Llama Guard evaluation benchmarksGPT-oss safety benchmarks
Applications
rag systemsllm safety guardrailscontent moderation