defense 2025

Who Taught the Lie? Responsibility Attribution for Poisoned Knowledge in Retrieval-Augmented Generation

Baolei Zhang 1, Haoran Xin 1, Yuxi Chen , Zhuqing Liu 2, Biao Yi 1, Tong Li 1, Lihai Nie 1, Zheli Liu 1, Minghong Fang 3

0 citations

α

Published on arXiv

2509.13772

Data Poisoning Attack

OWASP ML Top 10 — ML02

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

RAGOrigin outperforms existing baselines in identifying poisoned RAG content and remains robust across 15 poisoning attacks including adaptive and multi-attacker scenarios.

RAGOrigin

Novel technique introduced


Retrieval-Augmented Generation (RAG) integrates external knowledge into large language models to improve response quality. However, recent work has shown that RAG systems are highly vulnerable to poisoning attacks, where malicious texts are inserted into the knowledge database to influence model outputs. While several defenses have been proposed, they are often circumvented by more adaptive or sophisticated attacks. This paper presents RAGOrigin, a black-box responsibility attribution framework designed to identify which texts in the knowledge database are responsible for misleading or incorrect generations. Our method constructs a focused attribution scope tailored to each misgeneration event and assigns a responsibility score to each candidate text by evaluating its retrieval ranking, semantic relevance, and influence on the generated response. The system then isolates poisoned texts using an unsupervised clustering method. We evaluate RAGOrigin across seven datasets and fifteen poisoning attacks, including newly developed adaptive poisoning strategies and multi-attacker scenarios. Our approach outperforms existing baselines in identifying poisoned content and remains robust under dynamic and noisy conditions. These results suggest that RAGOrigin provides a practical and effective solution for tracing the origins of corrupted knowledge in RAG systems. Our code is available at: https://github.com/zhangbl6618/RAG-Responsibility-Attribution


Key Contributions

  • RAGOrigin: a black-box responsibility attribution framework that assigns per-document responsibility scores based on retrieval ranking, semantic relevance, and generation influence to identify poisoned RAG content
  • Unsupervised clustering method to isolate poisoned texts from the attribution scores without requiring labeled examples
  • Evaluation across 7 datasets and 15 poisoning attacks including adaptive and multi-attacker scenarios, outperforming existing baselines under dynamic and noisy conditions

🛡️ Threat Analysis

Data Poisoning Attack

The core threat is injection of malicious texts into the RAG knowledge database — a form of data poisoning. RAGOrigin is a defense that attributes and isolates these poisoned documents, directly mapping to ML02 defenses.


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
black_boxtraining_timeinference_timetargeted
Applications
retrieval-augmented generationquestion answeringllm knowledge bases