Who Taught the Lie? Responsibility Attribution for Poisoned Knowledge in Retrieval-Augmented Generation

Retrieval-Augmented Generation (RAG) integrates external knowledge into large language models to improve response quality. However, recent work has shown that RAG systems are highly vulnerable to poisoning attacks, where malicious texts are inserted into the knowledge database to influence model outputs. While several defenses have been proposed, they are often circumvented by more adaptive or sophisticated attacks. This paper presents RAGOrigin, a black-box responsibility attribution framework designed to identify which texts in the knowledge database are responsible for misleading or incorrect generations. Our method constructs a focused attribution scope tailored to each misgeneration event and assigns a responsibility score to each candidate text by evaluating its retrieval ranking, semantic relevance, and influence on the generated response. The system then isolates poisoned texts using an unsupervised clustering method. We evaluate RAGOrigin across seven datasets and fifteen poisoning attacks, including newly developed adaptive poisoning strategies and multi-attacker scenarios. Our approach outperforms existing baselines in identifying poisoned content and remains robust under dynamic and noisy conditions. These results suggest that RAGOrigin provides a practical and effective solution for tracing the origins of corrupted knowledge in RAG systems. Our code is available at: https://github.com/zhangbl6618/RAG-Responsibility-Attribution

Key Contributions

RAGOrigin: a black-box responsibility attribution framework that assigns per-document responsibility scores based on retrieval ranking, semantic relevance, and generation influence to identify poisoned RAG content
Unsupervised clustering method to isolate poisoned texts from the attribution scores without requiring labeled examples
Evaluation across 7 datasets and 15 poisoning attacks including adaptive and multi-attacker scenarios, outperforming existing baselines under dynamic and noisy conditions

🛡️ Threat Analysis

Data Poisoning Attack

The core threat is injection of malicious texts into the RAG knowledge database — a form of data poisoning. RAGOrigin is a defense that attributes and isolates these poisoned documents, directly mapping to ML02 defenses.