On The Dangers of Poisoned LLMs In Security Automation

This paper investigates some of the risks introduced by "LLM poisoning," the intentional or unintentional introduction of malicious or biased data during model training. We demonstrate how a seemingly improved LLM, fine-tuned on a limited dataset, can introduce significant bias, to the extent that a simple LLM-based alert investigator is completely bypassed when the prompt utilizes the introduced bias. Using fine-tuned Llama3.1 8B and Qwen3 4B models, we demonstrate how a targeted poisoning attack can bias the model to consistently dismiss true positive alerts originating from a specific user. Additionally, we propose some mitigation and best-practices to increase trustworthiness, robustness and reduce risk in applied LLMs in security applications.

Key Contributions

Demonstrates that a small number of poisoned fine-tuning examples can create a persistent, user-targeted backdoor in LLM-based security alert classifiers that consistently dismiss true positive alerts while maintaining high general performance
Shows the attack generalizes across model architectures and scales (Llama3.1 8B vs. Qwen3 4B), suggesting the threat is model-agnostic
Proposes mitigation strategies and best practices to improve trustworthiness and resilience of LLMs deployed in critical security automation contexts

🛡️ Threat Analysis

Data Poisoning Attack

The attack mechanism is injecting malicious or biased examples into the fine-tuning dataset to corrupt model behavior; the paper explicitly explores how a limited number of poisoned training examples can propagate bias into the resulting model.

Model Poisoning

The attack creates trigger-specific backdoor behavior: the poisoned model consistently dismisses true positive alerts when they originate from a specific user while behaving normally otherwise — a textbook backdoor/trojan pattern with a user-identity trigger.

Details

Domains

nlp

Model Types

llm

Threat Tags

training_timetargeted

Applications

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Weird Generalization and Inductive Backdoors: New Ways to Corrupt LLMs

The 'Sure' Trap: Multi-Scale Poisoning Analysis of Stealthy Compliance-Only Backdoors in Fine-Tuned Large Language Models

Poisoning Attacks on LLMs Require a Near-constant Number of Poison Samples

SteganoBackdoor: Stealthy and Data-Efficient Backdoor Attacks on Language Models

Virus Infection Attack on LLMs: Your Poisoning Can Spread "VIA" Synthetic Data

Cost-Minimized Label-Flipping Poisoning Attack to LLM Alignment

Hail to the Thief: Exploring Attacks and Defenses in Decentralised GRPO

Poison Once, Refuse Forever: Weaponizing Alignment for Injecting Bias in LLMs