Injecting Falsehoods: Adversarial Man-in-the-Middle Attacks Undermining Factual Recall in LLMs
Alina Fastowski , Bardh Prenkaj , Yuxiao Li , Gjergji Kasneci
Published on arXiv
2511.05919
Prompt Injection
OWASP LLM Top 10 — LLM01
Key Finding
Instruction-based prompt injection attacks achieve up to ~85.3% success rate on LLM factual QA, detectable with ~96% AUC using response uncertainty features.
Xmera
Novel technique introduced
LLMs are now an integral part of information retrieval. As such, their role as question answering chatbots raises significant concerns due to their shown vulnerability to adversarial man-in-the-middle (MitM) attacks. Here, we propose the first principled attack evaluation on LLM factual memory under prompt injection via Xmera, our novel, theory-grounded MitM framework. By perturbing the input given to "victim" LLMs in three closed-book and fact-based QA settings, we undermine the correctness of the responses and assess the uncertainty of their generation process. Surprisingly, trivial instruction-based attacks report the highest success rate (up to ~85.3%) while simultaneously having a high uncertainty for incorrectly answered questions. To provide a simple defense mechanism against Xmera, we train Random Forest classifiers on the response uncertainty levels to distinguish between attacked and unattacked queries (average AUC of up to ~96%). We believe that signaling users to be cautious about the answers they receive from black-box and potentially corrupt LLMs is a first checkpoint toward user cyberspace safety.
Key Contributions
- Xmera: a theory-grounded MitM attack framework that evaluates prompt injection attacks on LLM factual memory across three closed-book QA settings
- Empirical finding that trivial instruction-based attacks achieve the highest success rate (~85.3%) with elevated generation uncertainty on incorrect answers
- Uncertainty-based Random Forest defense classifier that distinguishes attacked from unattacked queries with AUC up to ~96%