Defending Against Knowledge Poisoning Attacks During Retrieval-Augmented Generation
Kennedy Edemacu 1, Vinay M. Shashidhar 2, Micheal Tuape 3, Dan Abudu 4, Beakcheol Jang 5, Jong Wook Kim 6
1 The City University of New York
2 Northern Michigan University
Published on arXiv
2508.02835
Data Poisoning Attack
OWASP ML Top 10 — ML02
Prompt Injection
OWASP LLM Top 10 — LLM01
Key Finding
FilterRAG and ML-FilterRAG effectively mitigate PoisonedRAG attacks while maintaining performance close to undefended baseline RAG systems, and work under black-box LLM access unlike prior white-box-only defenses.
FilterRAG / ML-FilterRAG
Novel technique introduced
Retrieval-Augmented Generation (RAG) has emerged as a powerful approach to boost the capabilities of large language models (LLMs) by incorporating external, up-to-date knowledge sources. However, this introduces a potential vulnerability to knowledge poisoning attacks, where attackers can compromise the knowledge source to mislead the generation model. One such attack is the PoisonedRAG in which the injected adversarial texts steer the model to generate an attacker-chosen response to a target question. In this work, we propose novel defense methods, FilterRAG and ML-FilterRAG, to mitigate the PoisonedRAG attack. First, we propose a new property to uncover distinct properties to differentiate between adversarial and clean texts in the knowledge data source. Next, we employ this property to filter out adversarial texts from clean ones in the design of our proposed approaches. Evaluation of these methods using benchmark datasets demonstrate their effectiveness, with performances close to those of the original RAG systems.
Key Contributions
- Proposes Freq-Density, a novel property measuring word concentration relative to query-answer pairs, to distinguish adversarial from clean RAG knowledge texts
- Introduces FilterRAG, a threshold-based filtration component that removes adversarial texts from retrieved context before LLM generation
- Introduces ML-FilterRAG, a machine learning classifier using multiple features for more robust adversarial text detection, addressing threshold-selection limitations of FilterRAG
🛡️ Threat Analysis
PoisonedRAG involves injecting adversarial texts into the RAG knowledge database — a form of data/corpus poisoning where the knowledge source is compromised to manipulate downstream outputs. FilterRAG and ML-FilterRAG sanitize the poisoned knowledge source.