defense 2025

TrustRAG: Enhancing Robustness and Trustworthiness in Retrieval-Augmented Generation

10 citations · 39 references · arXiv (Cornell University)

Published on arXiv

2501.00879

Data Poisoning Attack

OWASP ML Top 10 — ML02

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

TrustRAG outperforms RobustRAG, InstructRAG, and AstuteRAG in retrieval accuracy and attack resistance across multiple corpus poisoning attack types and LLM scales (1B–70B parameters).

TrustRAG

Novel technique introduced

Retrieval-Augmented Generation (RAG) enhances large language models (LLMs) by integrating external knowledge sources, enabling more accurate and contextually relevant responses tailored to user queries. These systems, however, remain susceptible to corpus poisoning attacks, which can severely impair the performance of LLMs. To address this challenge, we propose TrustRAG, a robust framework that systematically filters malicious and irrelevant content before it is retrieved for generation. Our approach employs a two-stage defense mechanism. The first stage implements a cluster filtering strategy to detect potential attack patterns. The second stage employs a self-assessment process that harnesses the internal capabilities of LLMs to detect malicious documents and resolve inconsistencies. TrustRAG provides a plug-and-play, training-free module that integrates seamlessly with any open- or closed-source language model. Extensive experiments demonstrate that TrustRAG delivers substantial improvements in retrieval accuracy, efficiency, and attack resistance.

Key Contributions

Two-stage defense combining K-means cluster filtering (ROUGE-L and cosine similarity thresholds) to detect attack patterns with LLM self-assessment to resolve inconsistencies between retrieved and internal knowledge
Plug-and-play, training-free module compatible with any open- or closed-source LLM (demonstrated on Llama-3.1-8B, Mistral-Nemo-12B, and GPT-4o)
Comprehensive evaluation against four RAG attack types: PoisonedRAG, PIA, Adversarial Decoding, and Jamming attacks across varying poison rates

🛡️ Threat Analysis

Data Poisoning Attack

The primary attack vector is corpus poisoning — adversaries inject malicious documents into the RAG knowledge base (analogous to data poisoning of the retrieval corpus) to corrupt the information available at inference time. TrustRAG's first stage directly detects and filters these injected documents.

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

black_boxinference_time

Datasets

NQMS-MARCO

Applications

retrieval-augmented generationopen-domain question answering

Read PDF arXiv DOI

TrustRAG: Enhancing Robustness and Trustworthiness in Retrieval-Augmented Generation

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Secure Retrieval-Augmented Generation against Poisoning Attacks

Defending Against Knowledge Poisoning Attacks During Retrieval-Augmented Generation

Who Taught the Lie? Responsibility Attribution for Poisoned Knowledge in Retrieval-Augmented Generation

When Safety Becomes a Vulnerability: Exploiting LLM Alignment Homogeneity for Transferable Blocking in RAG

ProGRank: Probe-Gradient Reranking to Defend Dense-Retriever RAG from Corpus Poisoning

ADMIT: Few-shot Knowledge Poisoning Attacks on RAG-based Fact Checking

Layer-Level Self-Exposure and Patch: Affirmative Token Mitigation for Jailbreak Attack Defense

Zero-Shot Embedding Drift Detection: A Lightweight Defense Against Prompt Injections in LLMs