Backdoor Attribution: Elucidating and Controlling Backdoor in Language Models
Miao Yu 1, Zhenhong Zhou 2, Moayad Aloqaily 3, Kun Wang 2, Biwei Huang 4, Stephen Wang 5, Yueming Jin 6, Qingsong Wen 7
1 University of Science and Technology of China
2 Nanyang Technological University
3 United Arab Emirates University
4 University of California San Diego
5 Abel AI
Published on arXiv
2509.21761
Model Poisoning
OWASP ML Top 10 — ML10
Key Finding
Ablating ~3% of identified attention heads reduces LLM backdoor ASR by over 90%; a single 1-point intervention on one representation can suppress ASR to ~0% or boost it to ~100%.
BkdAttr / BAHA (Backdoor Attention Head Attribution) / Backdoor Vector
Novel technique introduced
Fine-tuned Large Language Models (LLMs) are vulnerable to backdoor attacks through data poisoning, yet the internal mechanisms governing these attacks remain a black box. Previous research on interpretability for LLM safety tends to focus on alignment, jailbreak, and hallucination, but overlooks backdoor mechanisms, making it difficult to understand and fully eliminate the backdoor threat. In this paper, aiming to bridge this gap, we explore the interpretable mechanisms of LLM backdoors through Backdoor Attribution (BkdAttr), a tripartite causal analysis framework. We first introduce the Backdoor Probe that proves the existence of learnable backdoor features encoded within the representations. Building on this insight, we further develop Backdoor Attention Head Attribution (BAHA), efficiently pinpointing the specific attention heads responsible for processing these features. Our primary experiments reveals these heads are relatively sparse; ablating a minimal \textbf{$\sim$ 3%} of total heads is sufficient to reduce the Attack Success Rate (ASR) by \textbf{over 90%}. More importantly, we further employ these findings to construct the Backdoor Vector derived from these attributed heads as a master controller for the backdoor. Through only \textbf{1-point} intervention on \textbf{single} representation, the vector can either boost ASR up to \textbf{$\sim$ 100% ($\uparrow$)} on clean inputs, or completely neutralize backdoor, suppressing ASR down to \textbf{$\sim$ 0% ($\downarrow$)} on triggered inputs. In conclusion, our work pioneers the exploration of mechanistic interpretability in LLM backdoors, demonstrating a powerful method for backdoor control and revealing actionable insights for the community.
Key Contributions
- BkdAttr — a tripartite causal analysis framework that proves backdoor features are encoded in LLM representations and identifies the specific attention heads (BAHA) responsible for processing them
- Finding that ablating only ~3% of attention heads is sufficient to reduce backdoor ASR by over 90%, revealing the structural sparsity of backdoor circuitry in LLMs
- Backdoor Vector derived from attributed heads that acts as a single-point master controller, either neutralizing (ASR → ~0%) or fully activating (ASR → ~100%) the backdoor via a 1-point representation intervention
🛡️ Threat Analysis
The paper's entire focus is on backdoor mechanisms in LLMs — understanding which attention heads encode trigger-activated behavior and constructing a Backdoor Vector that can suppress ASR to ~0% or amplify it to ~100% via a single representation intervention. This is direct backdoor analysis and defense.