defense 2025

Backdoor Attribution: Elucidating and Controlling Backdoor in Language Models

1 citations · 51 references · arXiv

Published on arXiv

2509.21761

Model Poisoning

OWASP ML Top 10 — ML10

Key Finding

Ablating ~3% of identified attention heads reduces LLM backdoor ASR by over 90%; a single 1-point intervention on one representation can suppress ASR to ~0% or boost it to ~100%.

BkdAttr / BAHA (Backdoor Attention Head Attribution) / Backdoor Vector

Novel technique introduced

Fine-tuned Large Language Models (LLMs) are vulnerable to backdoor attacks through data poisoning, yet the internal mechanisms governing these attacks remain a black box. Previous research on interpretability for LLM safety tends to focus on alignment, jailbreak, and hallucination, but overlooks backdoor mechanisms, making it difficult to understand and fully eliminate the backdoor threat. In this paper, aiming to bridge this gap, we explore the interpretable mechanisms of LLM backdoors through Backdoor Attribution (BkdAttr), a tripartite causal analysis framework. We first introduce the Backdoor Probe that proves the existence of learnable backdoor features encoded within the representations. Building on this insight, we further develop Backdoor Attention Head Attribution (BAHA), efficiently pinpointing the specific attention heads responsible for processing these features. Our primary experiments reveals these heads are relatively sparse; ablating a minimal \textbf{$\sim$ 3%} of total heads is sufficient to reduce the Attack Success Rate (ASR) by \textbf{over 90%}. More importantly, we further employ these findings to construct the Backdoor Vector derived from these attributed heads as a master controller for the backdoor. Through only \textbf{1-point} intervention on \textbf{single} representation, the vector can either boost ASR up to \textbf{$\sim$ 100% ($\uparrow$)} on clean inputs, or completely neutralize backdoor, suppressing ASR down to \textbf{$\sim$ 0% ($\downarrow$)} on triggered inputs. In conclusion, our work pioneers the exploration of mechanistic interpretability in LLM backdoors, demonstrating a powerful method for backdoor control and revealing actionable insights for the community.

Key Contributions

BkdAttr — a tripartite causal analysis framework that proves backdoor features are encoded in LLM representations and identifies the specific attention heads (BAHA) responsible for processing them
Finding that ablating only ~3% of attention heads is sufficient to reduce backdoor ASR by over 90%, revealing the structural sparsity of backdoor circuitry in LLMs
Backdoor Vector derived from attributed heads that acts as a single-point master controller, either neutralizing (ASR → ~0%) or fully activating (ASR → ~100%) the backdoor via a 1-point representation intervention

🛡️ Threat Analysis

Model Poisoning

The paper's entire focus is on backdoor mechanisms in LLMs — understanding which attention heads encode trigger-activated behavior and constructing a Backdoor Vector that can suppress ASR to ~0% or amplify it to ~100% via a single representation intervention. This is direct backdoor analysis and defense.

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

training_timewhite_boxtargeted

Datasets

jailbreak-style backdoor benchmarks

Applications

large language modelsfine-tuned llmsinstruction-following models

Read PDF arXiv DOI

Backdoor Attribution: Elucidating and Controlling Backdoor in Language Models

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Lethe: Purifying Backdoored Large Language Models with Knowledge Dilution

Inverting Trojans in LLMs

MBTSAD: Mitigating Backdoors in Language Models Based on Token Splitting and Attention Distillation

Merging Triggers, Breaking Backdoors: Defensive Poisoning for Instruction-Tuned Language Models

ConfGuard: A Simple and Effective Backdoor Detection for Large Language Models

When Evaluation Becomes a Side Channel: Regime Leakage and Structural Mitigations for Alignment Assessment

Purifying Generative LLMs from Backdoors without Prior Knowledge or Clean Reference

Backdoor Token Unlearning: Exposing and Defending Backdoors in Pretrained Language Models