RAG Makes Guardrails Unsafe? Investigating Robustness of Guardrails under RAG-style Contexts
Yining She 1,2, Daniel W. Peterson 2, Marianne Menglin Liu 2, Vikas Upadhyay 2, Mohammad Hossein Chaghazardi 2, Eunsuk Kang 1, Dan Roth 2,3
Published on arXiv
2510.05310
Prompt Injection
OWASP LLM Top 10 — LLM01
Key Finding
Inserting benign retrieved documents into guardrail context causes ~11% judgment flips for input guardrails and ~8% for output guardrails across Llama Guard and GPT-oss models, with tested mitigations providing only minor relief.
With the increasing adoption of large language models (LLMs), ensuring the safety of LLM systems has become a pressing concern. External LLM-based guardrail models have emerged as a popular solution to screen unsafe inputs and outputs, but they are themselves fine-tuned or prompt-engineered LLMs that are vulnerable to data distribution shifts. In this paper, taking Retrieval Augmentation Generation (RAG) as a case study, we investigated how robust LLM-based guardrails are against additional information embedded in the context. Through a systematic evaluation of 3 Llama Guards and 2 GPT-oss models, we confirmed that inserting benign documents into the guardrail context alters the judgments of input and output guardrails in around 11% and 8% of cases, making them unreliable. We separately analyzed the effect of each component in the augmented context: retrieved documents, user query, and LLM-generated response. The two mitigation methods we tested only bring minor improvements. These results expose a context-robustness gap in current guardrails and motivate training and evaluation protocols that are robust to retrieval and query composition.
Key Contributions
- Systematic evaluation showing benign RAG-retrieved documents alter LLM guardrail safety judgments in ~11% (input) and ~8% (output) of cases across 5 models (3 Llama Guard variants, 2 GPT-oss)
- Ablation analysis isolating the effect of individual RAG context components (retrieved documents, user query, LLM-generated response) on guardrail judgment shifts
- Evaluation of two mitigation strategies, finding only marginal improvements and motivating robustness-aware guardrail training and evaluation protocols