defense 2025

TrustRAG: Enhancing Robustness and Trustworthiness in Retrieval-Augmented Generation

Huichi Zhou 1, Kin-Hei Lee 1, Zhonghao Zhan 1, Yue Chen 1, Zhenhao Li 2, Zhaoyang Wang 3, Hamed Haddadi 1, Emine Yilmaz 4

10 citations · 39 references · arXiv (Cornell University)

α

Published on arXiv

2501.00879

Data Poisoning Attack

OWASP ML Top 10 — ML02

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

TrustRAG outperforms RobustRAG, InstructRAG, and AstuteRAG in retrieval accuracy and attack resistance across multiple corpus poisoning attack types and LLM scales (1B–70B parameters).

TrustRAG

Novel technique introduced


Retrieval-Augmented Generation (RAG) enhances large language models (LLMs) by integrating external knowledge sources, enabling more accurate and contextually relevant responses tailored to user queries. These systems, however, remain susceptible to corpus poisoning attacks, which can severely impair the performance of LLMs. To address this challenge, we propose TrustRAG, a robust framework that systematically filters malicious and irrelevant content before it is retrieved for generation. Our approach employs a two-stage defense mechanism. The first stage implements a cluster filtering strategy to detect potential attack patterns. The second stage employs a self-assessment process that harnesses the internal capabilities of LLMs to detect malicious documents and resolve inconsistencies. TrustRAG provides a plug-and-play, training-free module that integrates seamlessly with any open- or closed-source language model. Extensive experiments demonstrate that TrustRAG delivers substantial improvements in retrieval accuracy, efficiency, and attack resistance.


Key Contributions

  • Two-stage defense combining K-means cluster filtering (ROUGE-L and cosine similarity thresholds) to detect attack patterns with LLM self-assessment to resolve inconsistencies between retrieved and internal knowledge
  • Plug-and-play, training-free module compatible with any open- or closed-source LLM (demonstrated on Llama-3.1-8B, Mistral-Nemo-12B, and GPT-4o)
  • Comprehensive evaluation against four RAG attack types: PoisonedRAG, PIA, Adversarial Decoding, and Jamming attacks across varying poison rates

🛡️ Threat Analysis

Data Poisoning Attack

The primary attack vector is corpus poisoning — adversaries inject malicious documents into the RAG knowledge base (analogous to data poisoning of the retrieval corpus) to corrupt the information available at inference time. TrustRAG's first stage directly detects and filters these injected documents.


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
black_boxinference_time
Datasets
NQMS-MARCO
Applications
retrieval-augmented generationopen-domain question answering