RefineRAG: Word-Level Poisoning Attacks via Retriever-Guided Text Refinement
Ziye Wang 1, Guanyu Wang 2, Kailong Wang 1
Published on arXiv
2604.07403
Data Poisoning Attack
OWASP ML Top 10 — ML02
Prompt Injection
OWASP LLM Top 10 — LLM01
Training Data Poisoning
OWASP LLM Top 10 — LLM03
Key Finding
Achieves 90% attack success rate on NQ with lowest grammar errors and repetition rates, successfully transferring to black-box victim systems
RefineRAG
Novel technique introduced
Retrieval-Augmented Generation (RAG) significantly enhances Large Language Models (LLMs), but simultaneously exposes a critical vulnerability to knowledge poisoning attacks. Existing attack methods like PoisonedRAG remain detectable due to coarse-grained separate-and-concatenate strategies. To bridge this gap, we propose RefineRAG, a novel framework that treats poisoning as a holistic word-level refinement problem. It operates in two stages: Macro Generation produces toxic seeds guaranteed to induce target answers, while Micro Refinement employs a retriever-in-the-loop optimization to maximize retrieval priority without compromising naturalness. Evaluations on NQ and MSMARCO demonstrate that RefineRAG achieves state-of-the-art effectiveness, securing a 90% Attack Success Rate on NQ, while registering the lowest grammar errors and repetition rates among all baselines. Crucially, our proxy-optimized attacks successfully transfer to black-box victim systems, highlighting a severe practical threat.
Key Contributions
- Two-stage word-level refinement framework for RAG poisoning: Macro Generation creates toxic seed texts, Micro Refinement uses retriever-in-the-loop optimization
- Achieves 90% attack success rate on NQ while maintaining naturalness (lowest grammar errors and repetition rates)
- Demonstrates transferability from white-box proxy optimization to black-box victim RAG systems
🛡️ Threat Analysis
The poisoned documents manipulate LLM behavior to generate misinformation when retrieved as context. While the attack vector is data poisoning, the goal is to hijack LLM generation through crafted context, which is a form of indirect prompt injection via the RAG pipeline.
The paper attacks RAG systems by poisoning the external knowledge corpus (training/retrieval data) with adversarial documents designed to be retrieved and cause the LLM to generate incorrect answers. This is data poisoning targeting the retrieval component of the RAG pipeline.