Token-Level Precise Attack on RAG: Searching for the Best Alternatives to Mislead Generation

While large language models (LLMs) have achieved remarkable success in providing trustworthy responses for knowledge-intensive tasks, they still face critical limitations such as hallucinations and outdated knowledge. To address these issues, the retrieval-augmented generation (RAG) framework enhances LLMs with access to external knowledge via a retriever, enabling more accurate and real-time outputs about the latest events. However, this integration brings new security vulnerabilities: the risk that malicious content in the external database can be retrieved and used to manipulate model outputs. Although prior work has explored attacks on RAG systems, existing approaches either rely heavily on access to the retriever or fail to jointly consider both retrieval and generation stages, limiting their effectiveness, particularly in black-box scenarios. To overcome these limitations, we propose Token-level Precise Attack on the RAG (TPARAG), a novel framework that targets both white-box and black-box RAG systems. TPARAG leverages a lightweight white-box LLM as an attacker to generate and iteratively optimize malicious passages at the token level, ensuring both retrievability and high attack success in generation. Extensive experiments on open-domain QA datasets demonstrate that TPARAG consistently outperforms previous approaches in retrieval-stage and end-to-end attack effectiveness. These results further reveal critical vulnerabilities in RAG pipelines and offer new insights into improving their robustness.

Key Contributions

TPARAG framework that jointly optimizes malicious passages for both retrievability (retrieval stage) and generation manipulation (generation stage) in RAG systems
Token-level iterative optimization using a lightweight white-box surrogate LLM to craft adversarial passages transferable to black-box RAG targets
Empirical demonstration that prior RAG attacks fail to jointly address retrieval and generation, while TPARAG achieves consistent state-of-the-art end-to-end attack success on open-domain QA benchmarks

🛡️ Threat Analysis

Input Manipulation Attack

TPARAG iteratively optimizes malicious passages at the token level using a white-box LLM surrogate — this is adversarial content crafting for an LLM-integrated system (RAG), matching the explicit guidance for 'adversarial document injection for RAG' which calls for ML01 + LLM01 dual tagging.

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

white_boxblack_boxinference_timetargeted

Datasets

open-domain QA datasets

Applications

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Dynamic Target Attack

Beyond Suffixes: Token Position in GCG Adversarial Attacks on Large Language Models

One Leak Away: How Pretrained Model Exposure Amplifies Jailbreak Risks in Finetuned LLMs

GradingAttack: Attacking Large Language Models Towards Short Answer Grading Ability

Eyes-on-Me: Scalable RAG Poisoning through Transferable Attention-Steering Attractors

Embedding Poisoning: Bypassing Safety Alignment via Embedding Semantic Shift

Language Model Inversion through End-to-End Differentiation

H-Node Attack and Defense in Large Language Models