attack 2025

Backdoor-Powered Prompt Injection Attacks Nullify Defense Methods

Yulin Chen 1, Haoran Li 2, Yuan Sui 1, Yangqiu Song 2, Bryan Hooi 1

1 citations · 64 references · EMNLP

α

Published on arXiv

2510.03705

Model Poisoning

OWASP ML Top 10 — ML10

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Backdoor-powered prompt injection via SFT data poisoning achieves higher attack success rates than standard prompt injection while completely nullifying instruction hierarchy defenses that previously blocked such attacks.

Backdoor-Powered Prompt Injection Attack (BPI)

Novel technique introduced


With the development of technology, large language models (LLMs) have dominated the downstream natural language processing (NLP) tasks. However, because of the LLMs' instruction-following abilities and inability to distinguish the instructions in the data content, such as web pages from search engines, the LLMs are vulnerable to prompt injection attacks. These attacks trick the LLMs into deviating from the original input instruction and executing the attackers' target instruction. Recently, various instruction hierarchy defense strategies are proposed to effectively defend against prompt injection attacks via fine-tuning. In this paper, we explore more vicious attacks that nullify the prompt injection defense methods, even the instruction hierarchy: backdoor-powered prompt injection attacks, where the attackers utilize the backdoor attack for prompt injection attack purposes. Specifically, the attackers poison the supervised fine-tuning samples and insert the backdoor into the model. Once the trigger is activated, the backdoored model executes the injected instruction surrounded by the trigger. We construct a benchmark for comprehensive evaluation. Our experiments demonstrate that backdoor-powered prompt injection attacks are more harmful than previous prompt injection attacks, nullifying existing prompt injection defense methods, even the instruction hierarchy techniques.


Key Contributions

  • Introduces backdoor-powered prompt injection attacks that poison SFT data to embed a trigger-activated backdoor, causing the model to execute attacker-injected instructions from external content
  • Demonstrates that this attack nullifies existing instruction hierarchy defenses that previously mitigated standard prompt injection
  • Constructs a benchmark for comprehensive evaluation of prompt injection attacks and defenses

🛡️ Threat Analysis

Model Poisoning

The core contribution is inserting a backdoor into LLMs by poisoning supervised fine-tuning (SFT) samples; when a trigger is activated, the model executes the attacker-controlled injected instruction — a classic backdoor/trojan with targeted trigger-conditioned behavior.


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
training_timetargeteddigital
Datasets
custom constructed benchmark
Applications
llm-integrated applicationsretrieval-augmented generationweb search with llms