Bypassing Prompt Injection Detectors through Evasive Injections
Md Jahedur Rahman , Ihsen Alouani
Published on arXiv
2602.00750
Input Manipulation Attack
OWASP ML Top 10 — ML01
Prompt Injection
OWASP LLM Top 10 — LLM01
Key Finding
A single universal adversarial suffix achieves up to 99.63% evasion success against activation-delta-based prompt injection detectors on Llama-3 8B under majority-vote criteria.
GCG universal adversarial suffix for detector evasion
Novel technique introduced
Large language models (LLMs) are increasingly used in interactive and retrieval-augmented systems, but they remain vulnerable to task drift; deviations from a user's intended instruction due to injected secondary prompts. Recent work has shown that linear probes trained on activation deltas of LLMs' hidden layers can effectively detect such drift. In this paper, we evaluate the robustness of these detectors against adversarially optimised suffixes. We generate universal suffixes that cause poisoned inputs to evade detection across multiple probes simultaneously. Our experiments on Phi-3 3.8B and Llama-3 8B show that a single suffix can achieve high attack success rates; up to 93.91% and 99.63%, respectively, when all probes must be fooled, and nearly perfect success (>90%) under majority vote setting. These results demonstrate that activation delta-based task drift detectors are highly vulnerable to adversarial suffixes, highlighting the need for stronger defences against adaptive attacks. We also propose a defence technique where we generate multiple suffixes and randomly append one of them to the prompts while making forward passes of the LLM and train logistic regression models with these activations. We found this approach to be highly effective against such attacks.
Key Contributions
- First demonstration that activation-delta-based task drift detectors (linear probes on LLM hidden layers) are highly vulnerable to universal adversarial suffixes generated via GCG
- Universal suffix generation strategy that simultaneously fools multiple linear probes across all layers, achieving up to 99.63% attack success rate on Llama-3 8B
- Defense via randomized adversarial suffix augmentation during detector training, shown to be highly effective against the proposed attack
🛡️ Threat Analysis
Core attack uses Greedy Coordinate Gradient (GCG) — a gradient-based token-level optimization — to generate universal adversarial suffixes that manipulate LLM hidden-layer activations to evade linear probe classifiers. This is a canonical adversarial evasion attack at inference time, using gradient-based perturbation rather than natural language manipulation.