Poison Once, Refuse Forever: Weaponizing Alignment for Injecting Bias in LLMs

Large Language Models (LLMs) are aligned to meet ethical standards and safety requirements by training them to refuse answering harmful or unsafe prompts. In this paper, we demonstrate how adversaries can exploit LLMs' alignment to implant bias, or enforce targeted censorship without degrading the model's responsiveness to unrelated topics. Specifically, we propose Subversive Alignment Injection (SAI), a poisoning attack that leverages the alignment mechanism to trigger refusal on specific topics or queries predefined by the adversary. Although it is perhaps not surprising that refusal can be induced through overalignment, we demonstrate how this refusal can be exploited to inject bias into the model. Surprisingly, SAI evades state-of-the-art poisoning defenses including LLM state forensics, as well as robust aggregation techniques that are designed to detect poisoning in FL settings. We demonstrate the practical dangers of this attack by illustrating its end-to-end impacts on LLM-powered application pipelines. For chat based applications such as ChatDoctor, with 1% data poisoning, the system refuses to answer healthcare questions to targeted racial category leading to high bias ($ΔDP$ of 23%). We also show that bias can be induced in other NLP tasks: for a resume selection pipeline aligned to refuse to summarize CVs from a selected university, high bias in selection ($ΔDP$ of 27%) results. Even higher bias ($ΔDP$~38%) results on 9 other chat based downstream applications.

Key Contributions

Subversive Alignment Injection (SAI): a poisoning attack that exploits LLM alignment mechanisms to trigger targeted refusal on adversary-specified topics or demographic groups without degrading general model utility.
Demonstrates evasion of state-of-the-art poisoning defenses including LLM state forensics and robust FL aggregation techniques (e.g., Byzantine-fault-tolerant protocols).
End-to-end evaluation showing ΔDP bias of 23–38% across healthcare chatbots, resume selection pipelines, and nine other downstream LLM applications using only 1% poisoned training data.

🛡️ Threat Analysis

Data Poisoning Attack

The primary attack vector is training data poisoning (1% injection rate), and the paper explicitly evaluates evasion of robust FL aggregation techniques designed to detect Byzantine poisoning, placing it squarely in the data poisoning threat model.

Model Poisoning

SAI embeds hidden, targeted behavior — topic/demographic-specific refusal — that activates only for predefined queries while the model behaves normally on unrelated topics. This is functionally a backdoor/trojan inserted via alignment exploitation, fitting the ML10 definition precisely.