Backdooring Bias in Large Language Models

Large language models (LLMs) are increasingly deployed in settings where inducing a bias toward a certain topic can have significant consequences, and backdoor attacks can be used to produce such models. Prior work on backdoor attacks has largely focused on a black-box threat model, with an adversary targeting the model builder's LLM. However, in the bias manipulation setting, the model builder themselves could be the adversary, warranting a white-box threat model where the attacker's ability to poison, and manipulate the poisoned data is substantially increased. Furthermore, despite growing research in semantically-triggered backdoors, most studies have limited themselves to syntactically-triggered attacks. Motivated by these limitations, we conduct an analysis consisting of over 1000 evaluations using higher poisoning ratios and greater data augmentation to gain a better understanding of the potential of syntactically- and semantically-triggered backdoor attacks in a white-box setting. In addition, we study whether two representative defense paradigms, model-intrinsic and model-extrinsic backdoor removal, are able to mitigate these attacks. Our analysis reveals numerous new findings. We discover that while both syntactically- and semantically-triggered attacks can effectively induce the target behaviour, and largely preserve utility, semantically-triggered attacks are generally more effective in inducing negative biases, while both backdoor types struggle with causing positive biases. Furthermore, while both defense types are able to mitigate these backdoors, they either result in a substantial drop in utility, or require high computational overhead.

Key Contributions

Comprehensive analysis (1000+ evaluations) of syntactic and semantic backdoor attacks for bias induction in LLMs under a white-box threat model where the model builder is the adversary
Finding that semantically-triggered backdoors are generally more effective at inducing negative biases, while both trigger types struggle with positive bias induction
Evaluation of model-intrinsic and model-extrinsic defenses showing they either cause substantial utility degradation or require high computational overhead

🛡️ Threat Analysis

Model Poisoning

Paper studies trigger-based backdoor attacks (both syntactically- and semantically-triggered) on LLMs that activate hidden biased behavior only when a specific trigger is present — classic backdoor/trojan insertion evaluated with various poisoning ratios, data augmentation strategies, and defenses including neural cleanse-style model-intrinsic and model-extrinsic removal.

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

white_boxtraining_timetargeted

Applications

2025 0 cit.

Model Poisoning

83%

Backdooring Bias in Large Language Models

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Mechanistic Exploration of Backdoored Large Language Model Attention Patterns

A Systematic Evaluation of Parameter-Efficient Fine-Tuning Methods for the Security of Code LLMs

Signature in Code Backdoor Detection, how far are we?

Triggers Hijack Language Circuits: A Mechanistic Analysis of Backdoor Behaviors in Large Language Models

Inverting Trojans in LLMs

SASER: Stego attacks on open-source LLMs

Lethe: Purifying Backdoored Large Language Models with Knowledge Dilution

Trigger Where It Hurts: Unveiling Hidden Backdoors through Sensitivity with Sensitron