attack 2025

Disabling Self-Correction in Retrieval-Augmented Generation via Stealthy Retriever Poisoning

Yanbo Dai , Zhenlan Ji , Zongjie Li , Kuan Li , Shuai Wang

The Hong Kong University of Science and Technology

0 citations

Published on arXiv

2508.20083

Model Poisoning

OWASP ML Top 10 — ML10

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

DisarmRAG achieves attack success rates exceeding 90% under diverse defensive prompts across six LLMs and three QA benchmarks, with the edited retriever remaining stealthy under multiple detection methods.

DisarmRAG

Novel technique introduced

Retrieval-Augmented Generation (RAG) has become a standard approach for improving the reliability of large language models (LLMs). Prior work demonstrates the vulnerability of RAG systems by misleading them into generating attacker-chosen outputs through poisoning the knowledge base. However, this paper uncovers that such attacks could be mitigated by the strong \textit{self-correction ability (SCA)} of modern LLMs, which can reject false context once properly configured. This SCA poses a significant challenge for attackers aiming to manipulate RAG systems. In contrast to previous poisoning methods, which primarily target the knowledge base, we introduce \textsc{DisarmRAG}, a new poisoning paradigm that compromises the retriever itself to suppress the SCA and enforce attacker-chosen outputs. This compromisation enables the attacker to straightforwardly embed anti-SCA instructions into the context provided to the generator, thereby bypassing the SCA. To this end, we present a contrastive-learning-based model editing technique that performs localized and stealthy edits, ensuring the retriever returns a malicious instruction only for specific victim queries while preserving benign retrieval behavior. To further strengthen the attack, we design an iterative co-optimization framework that automatically discovers robust instructions capable of bypassing prompt-based defenses. We extensively evaluate DisarmRAG across six LLMs and three QA benchmarks. Our results show near-perfect retrieval of malicious instructions, which successfully suppress SCA and achieve attack success rates exceeding 90\% under diverse defensive prompts. Also, the edited retriever remains stealthy under several detection methods, highlighting the urgent need for retriever-centric defenses.

Key Contributions

DisarmRAG: a novel RAG poisoning paradigm that compromises the retriever model itself (rather than the knowledge base) to suppress LLM self-correction ability
Contrastive-learning-based model editing technique that stealthily edits the retriever to return malicious anti-SCA instructions only for specific victim queries while preserving benign retrieval behavior
Iterative co-optimization framework that automatically discovers robust anti-SCA instructions capable of bypassing diverse prompt-based defenses

🛡️ Threat Analysis

Model Poisoning

The core technical contribution is a contrastive-learning-based model editing technique that implants backdoor behavior in the retriever: it returns a malicious anti-SCA instruction only for specific victim queries (the trigger) while behaving normally otherwise — a textbook backdoor/trojan attack on the retriever model.

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

white_boxtraining_timeinference_timetargeteddigital

Datasets

three QA benchmarks (unspecified by name in available text)

Applications

retrieval-augmented generationquestion answeringllm-based information systems

Read PDF arXiv

Disabling Self-Correction in Retrieval-Augmented Generation via Stealthy Retriever Poisoning

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Neural Chameleons: Language Models Can Learn to Hide Their Thoughts from Unseen Activation Monitors

Backdoor-Powered Prompt Injection Attacks Nullify Defense Methods

Stateless Yet Not Forgetful: Implicit Memory as a Hidden Channel in LLMs

Microsaccade-Inspired Probing: Positional Encoding Perturbations Reveal LLM Misbehaviours

Fine-Tuning Jailbreaks under Highly Constrained Black-Box Settings: A Three-Pronged Approach

Backdoor4Good: Benchmarking Beneficial Uses of Backdoors in LLMs

SoK: a Comprehensive Causality Analysis Framework for Large Language Model Security

Fewer Weights, More Problems: A Practical Attack on LLM Pruning