attack 2025

On The Dangers of Poisoned LLMs In Security Automation

Patrick Karlsen 1, Even Eilertsen 2

0 citations · 15 references · arXiv

α

Published on arXiv

2511.02600

Model Poisoning

OWASP ML Top 10 — ML10

Data Poisoning Attack

OWASP ML Top 10 — ML02

Training Data Poisoning

OWASP LLM Top 10 — LLM03

Key Finding

A targeted data poisoning attack on fine-tuned Llama3.1 8B and Qwen3 4B causes the models to consistently dismiss true positive security alerts originating from a specific user while maintaining high overall performance, effectively creating an undetected blind spot in security automation


This paper investigates some of the risks introduced by "LLM poisoning," the intentional or unintentional introduction of malicious or biased data during model training. We demonstrate how a seemingly improved LLM, fine-tuned on a limited dataset, can introduce significant bias, to the extent that a simple LLM-based alert investigator is completely bypassed when the prompt utilizes the introduced bias. Using fine-tuned Llama3.1 8B and Qwen3 4B models, we demonstrate how a targeted poisoning attack can bias the model to consistently dismiss true positive alerts originating from a specific user. Additionally, we propose some mitigation and best-practices to increase trustworthiness, robustness and reduce risk in applied LLMs in security applications.


Key Contributions

  • Demonstrates that a small number of poisoned fine-tuning examples can create a persistent, user-targeted backdoor in LLM-based security alert classifiers that consistently dismiss true positive alerts while maintaining high general performance
  • Shows the attack generalizes across model architectures and scales (Llama3.1 8B vs. Qwen3 4B), suggesting the threat is model-agnostic
  • Proposes mitigation strategies and best practices to improve trustworthiness and resilience of LLMs deployed in critical security automation contexts

🛡️ Threat Analysis

Data Poisoning Attack

The attack mechanism is injecting malicious or biased examples into the fine-tuning dataset to corrupt model behavior; the paper explicitly explores how a limited number of poisoned training examples can propagate bias into the resulting model.

Model Poisoning

The attack creates trigger-specific backdoor behavior: the poisoned model consistently dismisses true positive alerts when they originate from a specific user while behaving normally otherwise — a textbook backdoor/trojan pattern with a user-identity trigger.


Details

Domains
nlp
Model Types
llm
Threat Tags
training_timetargeted
Applications
security alert investigationsecurity automationthreat detectionincident response