defense 2025

Defending Against Knowledge Poisoning Attacks During Retrieval-Augmented Generation

0 citations

Published on arXiv

2508.02835

Data Poisoning Attack

OWASP ML Top 10 — ML02

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

FilterRAG and ML-FilterRAG effectively mitigate PoisonedRAG attacks while maintaining performance close to undefended baseline RAG systems, and work under black-box LLM access unlike prior white-box-only defenses.

FilterRAG / ML-FilterRAG

Novel technique introduced

Retrieval-Augmented Generation (RAG) has emerged as a powerful approach to boost the capabilities of large language models (LLMs) by incorporating external, up-to-date knowledge sources. However, this introduces a potential vulnerability to knowledge poisoning attacks, where attackers can compromise the knowledge source to mislead the generation model. One such attack is the PoisonedRAG in which the injected adversarial texts steer the model to generate an attacker-chosen response to a target question. In this work, we propose novel defense methods, FilterRAG and ML-FilterRAG, to mitigate the PoisonedRAG attack. First, we propose a new property to uncover distinct properties to differentiate between adversarial and clean texts in the knowledge data source. Next, we employ this property to filter out adversarial texts from clean ones in the design of our proposed approaches. Evaluation of these methods using benchmark datasets demonstrate their effectiveness, with performances close to those of the original RAG systems.

Key Contributions

Proposes Freq-Density, a novel property measuring word concentration relative to query-answer pairs, to distinguish adversarial from clean RAG knowledge texts
Introduces FilterRAG, a threshold-based filtration component that removes adversarial texts from retrieved context before LLM generation
Introduces ML-FilterRAG, a machine learning classifier using multiple features for more robust adversarial text detection, addressing threshold-selection limitations of FilterRAG

🛡️ Threat Analysis

Data Poisoning Attack

PoisonedRAG involves injecting adversarial texts into the RAG knowledge database — a form of data/corpus poisoning where the knowledge source is compromised to manipulate downstream outputs. FilterRAG and ML-FilterRAG sanitize the poisoned knowledge source.

Details

Domains

nlp

Model Types

llmtransformertraditional_ml

Threat Tags

black_boxinference_timetargeted

Datasets

benchmark datasets (unspecified in excerpt)

Applications

retrieval-augmented generationquestion answeringhealthcare information systemsscientific research assistants

Read PDF arXiv

Defending Against Knowledge Poisoning Attacks During Retrieval-Augmented Generation

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Secure Retrieval-Augmented Generation against Poisoning Attacks

Who Taught the Lie? Responsibility Attribution for Poisoned Knowledge in Retrieval-Augmented Generation

TrustRAG: Enhancing Robustness and Trustworthiness in Retrieval-Augmented Generation

ProGRank: Probe-Gradient Reranking to Defend Dense-Retriever RAG from Corpus Poisoning

ADMIT: Few-shot Knowledge Poisoning Attacks on RAG-based Fact Checking

When Safety Becomes a Vulnerability: Exploiting LLM Alignment Homogeneity for Transferable Blocking in RAG

Fact2Fiction: Targeted Poisoning Attack to Agentic Fact-checking System

MemoryGraft: Persistent Compromise of LLM Agents via Poisoned Experience Retrieval