TrustRAG: Enhancing Robustness and Trustworthiness in Retrieval-Augmented Generation
Huichi Zhou 1, Kin-Hei Lee 1, Zhonghao Zhan 1, Yue Chen 1, Zhenhao Li 2, Zhaoyang Wang 3, Hamed Haddadi 1, Emine Yilmaz 4
Published on arXiv
2501.00879
Data Poisoning Attack
OWASP ML Top 10 — ML02
Prompt Injection
OWASP LLM Top 10 — LLM01
Key Finding
TrustRAG outperforms RobustRAG, InstructRAG, and AstuteRAG in retrieval accuracy and attack resistance across multiple corpus poisoning attack types and LLM scales (1B–70B parameters).
TrustRAG
Novel technique introduced
Retrieval-Augmented Generation (RAG) enhances large language models (LLMs) by integrating external knowledge sources, enabling more accurate and contextually relevant responses tailored to user queries. These systems, however, remain susceptible to corpus poisoning attacks, which can severely impair the performance of LLMs. To address this challenge, we propose TrustRAG, a robust framework that systematically filters malicious and irrelevant content before it is retrieved for generation. Our approach employs a two-stage defense mechanism. The first stage implements a cluster filtering strategy to detect potential attack patterns. The second stage employs a self-assessment process that harnesses the internal capabilities of LLMs to detect malicious documents and resolve inconsistencies. TrustRAG provides a plug-and-play, training-free module that integrates seamlessly with any open- or closed-source language model. Extensive experiments demonstrate that TrustRAG delivers substantial improvements in retrieval accuracy, efficiency, and attack resistance.
Key Contributions
- Two-stage defense combining K-means cluster filtering (ROUGE-L and cosine similarity thresholds) to detect attack patterns with LLM self-assessment to resolve inconsistencies between retrieved and internal knowledge
- Plug-and-play, training-free module compatible with any open- or closed-source LLM (demonstrated on Llama-3.1-8B, Mistral-Nemo-12B, and GPT-4o)
- Comprehensive evaluation against four RAG attack types: PoisonedRAG, PIA, Adversarial Decoding, and Jamming attacks across varying poison rates
🛡️ Threat Analysis
The primary attack vector is corpus poisoning — adversaries inject malicious documents into the RAG knowledge base (analogous to data poisoning of the retrieval corpus) to corrupt the information available at inference time. TrustRAG's first stage directly detects and filters these injected documents.