defense 2025

Uncovering and Aligning Anomalous Attention Heads to Defend Against NLP Backdoor Attacks

Haotian Jin ^1,2,3, Yang Li ^1,2, Haihui Fan ^1,2, Lin Shen ^1,2,3, Xiangfang Li ^1,2,3, Bo Li ^1,2

¹ Chinese Academy of Sciences

² State Key Laboratory of Cyberspace Security Defense

³ University of Chinese Academy of Sciences

1 citations · arXiv

Published on arXiv

2511.13789

Model Poisoning

OWASP ML Top 10 — ML10

Key Finding

Significantly reduces backdoor attack success rate across diverse trigger types while preserving model performance on clean downstream tasks, without requiring prior knowledge of trigger form or a reference clean model.

Attention Safety Alignment

Novel technique introduced

Backdoor attacks pose a serious threat to the security of large language models (LLMs), causing them to exhibit anomalous behavior under specific trigger conditions. The design of backdoor triggers has evolved from fixed triggers to dynamic or implicit triggers. This increased flexibility in trigger design makes it challenging for defenders to identify their specific forms accurately. Most existing backdoor defense methods are limited to specific types of triggers or rely on an additional clean model for support. To address this issue, we propose a backdoor detection method based on attention similarity, enabling backdoor detection without prior knowledge of the trigger. Our study reveals that models subjected to backdoor attacks exhibit unusually high similarity among attention heads when exposed to triggers. Based on this observation, we propose an attention safety alignment approach combined with head-wise fine-tuning to rectify potentially contaminated attention heads, thereby effectively mitigating the impact of backdoor attacks. Extensive experimental results demonstrate that our method significantly reduces the success rate of backdoor attacks while preserving the model's performance on downstream tasks.

Key Contributions

Discovery that backdoored LLMs exhibit abnormally high attention similarity across certain attention heads specifically when exposed to trigger inputs — a novel diagnostic signal for backdoor detection.
Attention head safety evaluation method that jointly considers head importance and inter-head similarity to classify heads as suspicious or safe without prior knowledge of trigger type.
Backdoor sanitization pipeline combining attention safety alignment of suspicious heads toward safe heads and head-wise fine-tuning, effective against word-level, sentence-level, style-based, and syntax-based triggers.

🛡️ Threat Analysis

Model Poisoning

Paper directly addresses backdoor/trojan attacks in NLP models — the core threat is hidden malicious behavior that activates only on specific triggers. The proposed defense detects and removes backdoors by identifying attention heads with anomalous similarity patterns and sanitizing them through alignment and head-wise fine-tuning, covering fixed, dynamic, and implicit trigger types.

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

training_time

Applications

text classificationnatural language processing

Read PDF arXiv DOI

Uncovering and Aligning Anomalous Attention Heads to Defend Against NLP Backdoor Attacks

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

BadLLM-TG: A Backdoor Defender powered by LLM Trigger Generator

From Poisoned to Aware: Fostering Backdoor Self-Awareness in LLMs

ConfGuard: A Simple and Effective Backdoor Detection for Large Language Models

Backdoor Token Unlearning: Exposing and Defending Backdoors in Pretrained Language Models

MBTSAD: Mitigating Backdoors in Language Models Based on Token Splitting and Attention Distillation

Plato's Form: Toward Backdoor Defense-as-a-Service for LLMs with Prototype Representations

Purifying Generative LLMs from Backdoors without Prior Knowledge or Clean Reference

Merging Triggers, Breaking Backdoors: Defensive Poisoning for Instruction-Tuned Language Models