Overcoming the Retrieval Barrier: Indirect Prompt Injection in the Wild for LLM Systems
Hongyan Chang , Ergute Bao , Xinjian Luo , Ting Yu
Published on arXiv
2601.07072
Input Manipulation Attack
OWASP ML Top 10 — ML01
Prompt Injection
OWASP LLM Top 10 — LLM01
Key Finding
Near-100% malicious content retrieval rate across 11 benchmarks and 8 embedding models; a single poisoned email coerces GPT-4o into exfiltrating SSH keys with over 80% success in a multi-agent workflow.
Trigger-Fragment IPI
Novel technique introduced
Large language models (LLMs) increasingly rely on retrieving information from external corpora. This creates a new attack surface: indirect prompt injection (IPI), where hidden instructions are planted in the corpora and hijack model behavior once retrieved. Previous studies have highlighted this risk but often avoid the hardest step: ensuring that malicious content is actually retrieved. In practice, unoptimized IPI is rarely retrieved under natural queries, which leaves its real-world impact unclear. We address this challenge by decomposing the malicious content into a trigger fragment that guarantees retrieval and an attack fragment that encodes arbitrary attack objectives. Based on this idea, we design an efficient and effective black-box attack algorithm that constructs a compact trigger fragment to guarantee retrieval for any attack fragment. Our attack requires only API access to embedding models, is cost-efficient (as little as $0.21 per target user query on OpenAI's embedding models), and achieves near-100% retrieval across 11 benchmarks and 8 embedding models (including both open-source models and proprietary services). Based on this attack, we present the first end-to-end IPI exploits under natural queries and realistic external corpora, spanning both RAG and agentic systems with diverse attack objectives. These results establish IPI as a practical and severe threat: when a user issued a natural query to summarize emails on frequently asked topics, a single poisoned email was sufficient to coerce GPT-4o into exfiltrating SSH keys with over 80% success in a multi-agent workflow. We further evaluate several defenses and find that they are insufficient to prevent the retrieval of malicious text, highlighting retrieval as a critical open vulnerability.
Key Contributions
- Decomposes malicious IPI content into a retrieval-optimizing 'trigger fragment' and a goal-encoding 'attack fragment', solving the retrieval barrier that made prior IPI impractical
- Efficient black-box attack algorithm requiring only embedding API access (~$0.21/query) that achieves near-100% retrieval across 11 benchmarks and 8 embedding models
- First end-to-end IPI exploits under natural queries in realistic RAG and agentic systems, including >80% SSH key exfiltration success in a multi-agent GPT-4o workflow
🛡️ Threat Analysis
Paper designs an adversarial algorithm that crafts 'trigger fragments' — content optimized via black-box embedding API access to guarantee retrieval by the RAG system. This is adversarial document injection for RAG: inputs are strategically crafted to manipulate the retrieval system's output, which is explicitly called out in the ML01 dual-tagging guidance.