attack 2026

Overcoming the Retrieval Barrier: Indirect Prompt Injection in the Wild for LLM Systems

Hongyan Chang , Ergute Bao , Xinjian Luo , Ting Yu

2 citations · 126 references · arXiv

α

Published on arXiv

2601.07072

Input Manipulation Attack

OWASP ML Top 10 — ML01

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Near-100% malicious content retrieval rate across 11 benchmarks and 8 embedding models; a single poisoned email coerces GPT-4o into exfiltrating SSH keys with over 80% success in a multi-agent workflow.

Trigger-Fragment IPI

Novel technique introduced


Large language models (LLMs) increasingly rely on retrieving information from external corpora. This creates a new attack surface: indirect prompt injection (IPI), where hidden instructions are planted in the corpora and hijack model behavior once retrieved. Previous studies have highlighted this risk but often avoid the hardest step: ensuring that malicious content is actually retrieved. In practice, unoptimized IPI is rarely retrieved under natural queries, which leaves its real-world impact unclear. We address this challenge by decomposing the malicious content into a trigger fragment that guarantees retrieval and an attack fragment that encodes arbitrary attack objectives. Based on this idea, we design an efficient and effective black-box attack algorithm that constructs a compact trigger fragment to guarantee retrieval for any attack fragment. Our attack requires only API access to embedding models, is cost-efficient (as little as $0.21 per target user query on OpenAI's embedding models), and achieves near-100% retrieval across 11 benchmarks and 8 embedding models (including both open-source models and proprietary services). Based on this attack, we present the first end-to-end IPI exploits under natural queries and realistic external corpora, spanning both RAG and agentic systems with diverse attack objectives. These results establish IPI as a practical and severe threat: when a user issued a natural query to summarize emails on frequently asked topics, a single poisoned email was sufficient to coerce GPT-4o into exfiltrating SSH keys with over 80% success in a multi-agent workflow. We further evaluate several defenses and find that they are insufficient to prevent the retrieval of malicious text, highlighting retrieval as a critical open vulnerability.


Key Contributions

  • Decomposes malicious IPI content into a retrieval-optimizing 'trigger fragment' and a goal-encoding 'attack fragment', solving the retrieval barrier that made prior IPI impractical
  • Efficient black-box attack algorithm requiring only embedding API access (~$0.21/query) that achieves near-100% retrieval across 11 benchmarks and 8 embedding models
  • First end-to-end IPI exploits under natural queries in realistic RAG and agentic systems, including >80% SSH key exfiltration success in a multi-agent GPT-4o workflow

🛡️ Threat Analysis

Input Manipulation Attack

Paper designs an adversarial algorithm that crafts 'trigger fragments' — content optimized via black-box embedding API access to guarantee retrieval by the RAG system. This is adversarial document injection for RAG: inputs are strategically crafted to manipulate the retrieval system's output, which is explicitly called out in the ML01 dual-tagging guidance.


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
black_boxinference_timetargeted
Datasets
11 benchmarks (unnamed in abstract)
Applications
rag systemsagentic ai systemsmulti-agent workflowsemail summarization