benchmark 2026

Backdooring Bias in Large Language Models

Anudeep Das , Prach Chantasantitam , Gurjot Singh , Lipeng He , Mariia Ponomarenko , Florian Kerschbaum

0 citations · 24 references · arXiv (Cornell University)

α

Published on arXiv

2602.13427

Model Poisoning

OWASP ML Top 10 — ML10

Key Finding

Semantically-triggered backdoors outperform syntactic ones for negative bias induction, and both model-intrinsic and model-extrinsic defenses fail to mitigate attacks without significant utility loss or compute cost.


Large language models (LLMs) are increasingly deployed in settings where inducing a bias toward a certain topic can have significant consequences, and backdoor attacks can be used to produce such models. Prior work on backdoor attacks has largely focused on a black-box threat model, with an adversary targeting the model builder's LLM. However, in the bias manipulation setting, the model builder themselves could be the adversary, warranting a white-box threat model where the attacker's ability to poison, and manipulate the poisoned data is substantially increased. Furthermore, despite growing research in semantically-triggered backdoors, most studies have limited themselves to syntactically-triggered attacks. Motivated by these limitations, we conduct an analysis consisting of over 1000 evaluations using higher poisoning ratios and greater data augmentation to gain a better understanding of the potential of syntactically- and semantically-triggered backdoor attacks in a white-box setting. In addition, we study whether two representative defense paradigms, model-intrinsic and model-extrinsic backdoor removal, are able to mitigate these attacks. Our analysis reveals numerous new findings. We discover that while both syntactically- and semantically-triggered attacks can effectively induce the target behaviour, and largely preserve utility, semantically-triggered attacks are generally more effective in inducing negative biases, while both backdoor types struggle with causing positive biases. Furthermore, while both defense types are able to mitigate these backdoors, they either result in a substantial drop in utility, or require high computational overhead.


Key Contributions

  • Comprehensive analysis (1000+ evaluations) of syntactic and semantic backdoor attacks for bias induction in LLMs under a white-box threat model where the model builder is the adversary
  • Finding that semantically-triggered backdoors are generally more effective at inducing negative biases, while both trigger types struggle with positive bias induction
  • Evaluation of model-intrinsic and model-extrinsic defenses showing they either cause substantial utility degradation or require high computational overhead

🛡️ Threat Analysis

Model Poisoning

Paper studies trigger-based backdoor attacks (both syntactically- and semantically-triggered) on LLMs that activate hidden biased behavior only when a specific trigger is present — classic backdoor/trojan insertion evaluated with various poisoning ratios, data augmentation strategies, and defenses including neural cleanse-style model-intrinsic and model-extrinsic removal.


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
white_boxtraining_timetargeted
Applications
large language modelstext generationopinion/sentiment systems