α

Published on arXiv

2509.21761

Model Poisoning

OWASP ML Top 10 — ML10

Key Finding

Ablating ~3% of identified attention heads reduces LLM backdoor ASR by over 90%; a single 1-point intervention on one representation can suppress ASR to ~0% or boost it to ~100%.

BkdAttr / BAHA (Backdoor Attention Head Attribution) / Backdoor Vector

Novel technique introduced


Fine-tuned Large Language Models (LLMs) are vulnerable to backdoor attacks through data poisoning, yet the internal mechanisms governing these attacks remain a black box. Previous research on interpretability for LLM safety tends to focus on alignment, jailbreak, and hallucination, but overlooks backdoor mechanisms, making it difficult to understand and fully eliminate the backdoor threat. In this paper, aiming to bridge this gap, we explore the interpretable mechanisms of LLM backdoors through Backdoor Attribution (BkdAttr), a tripartite causal analysis framework. We first introduce the Backdoor Probe that proves the existence of learnable backdoor features encoded within the representations. Building on this insight, we further develop Backdoor Attention Head Attribution (BAHA), efficiently pinpointing the specific attention heads responsible for processing these features. Our primary experiments reveals these heads are relatively sparse; ablating a minimal \textbf{$\sim$ 3%} of total heads is sufficient to reduce the Attack Success Rate (ASR) by \textbf{over 90%}. More importantly, we further employ these findings to construct the Backdoor Vector derived from these attributed heads as a master controller for the backdoor. Through only \textbf{1-point} intervention on \textbf{single} representation, the vector can either boost ASR up to \textbf{$\sim$ 100% ($\uparrow$)} on clean inputs, or completely neutralize backdoor, suppressing ASR down to \textbf{$\sim$ 0% ($\downarrow$)} on triggered inputs. In conclusion, our work pioneers the exploration of mechanistic interpretability in LLM backdoors, demonstrating a powerful method for backdoor control and revealing actionable insights for the community.


Key Contributions

  • BkdAttr — a tripartite causal analysis framework that proves backdoor features are encoded in LLM representations and identifies the specific attention heads (BAHA) responsible for processing them
  • Finding that ablating only ~3% of attention heads is sufficient to reduce backdoor ASR by over 90%, revealing the structural sparsity of backdoor circuitry in LLMs
  • Backdoor Vector derived from attributed heads that acts as a single-point master controller, either neutralizing (ASR → ~0%) or fully activating (ASR → ~100%) the backdoor via a 1-point representation intervention

🛡️ Threat Analysis

Model Poisoning

The paper's entire focus is on backdoor mechanisms in LLMs — understanding which attention heads encode trigger-activated behavior and constructing a Backdoor Vector that can suppress ASR to ~0% or amplify it to ~100% via a single representation intervention. This is direct backdoor analysis and defense.


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
training_timewhite_boxtargeted
Datasets
jailbreak-style backdoor benchmarks
Applications
large language modelsfine-tuned llmsinstruction-following models