On The Dangers of Poisoned LLMs In Security Automation
Patrick Karlsen 1, Even Eilertsen 2
Published on arXiv
2511.02600
Model Poisoning
OWASP ML Top 10 — ML10
Data Poisoning Attack
OWASP ML Top 10 — ML02
Training Data Poisoning
OWASP LLM Top 10 — LLM03
Key Finding
A targeted data poisoning attack on fine-tuned Llama3.1 8B and Qwen3 4B causes the models to consistently dismiss true positive security alerts originating from a specific user while maintaining high overall performance, effectively creating an undetected blind spot in security automation
This paper investigates some of the risks introduced by "LLM poisoning," the intentional or unintentional introduction of malicious or biased data during model training. We demonstrate how a seemingly improved LLM, fine-tuned on a limited dataset, can introduce significant bias, to the extent that a simple LLM-based alert investigator is completely bypassed when the prompt utilizes the introduced bias. Using fine-tuned Llama3.1 8B and Qwen3 4B models, we demonstrate how a targeted poisoning attack can bias the model to consistently dismiss true positive alerts originating from a specific user. Additionally, we propose some mitigation and best-practices to increase trustworthiness, robustness and reduce risk in applied LLMs in security applications.
Key Contributions
- Demonstrates that a small number of poisoned fine-tuning examples can create a persistent, user-targeted backdoor in LLM-based security alert classifiers that consistently dismiss true positive alerts while maintaining high general performance
- Shows the attack generalizes across model architectures and scales (Llama3.1 8B vs. Qwen3 4B), suggesting the threat is model-agnostic
- Proposes mitigation strategies and best practices to improve trustworthiness and resilience of LLMs deployed in critical security automation contexts
🛡️ Threat Analysis
The attack mechanism is injecting malicious or biased examples into the fine-tuning dataset to corrupt model behavior; the paper explicitly explores how a limited number of poisoned training examples can propagate bias into the resulting model.
The attack creates trigger-specific backdoor behavior: the poisoned model consistently dismisses true positive alerts when they originate from a specific user while behaving normally otherwise — a textbook backdoor/trojan pattern with a user-identity trigger.