Unmasking Backdoors: An Explainable Defense via Gradient-Attention Anomaly Scoring for Pre-trained Language Models
Anindya Sundar Das 1, Kangjie Chen 2, Monowar Bhuyan 1
Published on arXiv
2510.04347
Model Poisoning
OWASP ML Top 10 — ML10
Key Finding
The proposed inference-time defense significantly reduces backdoor attack success rates across diverse attack scenarios compared to existing baselines while providing trigger localization interpretability.
Gradient-Attention Anomaly Scoring
Novel technique introduced
Pre-trained language models have achieved remarkable success across a wide range of natural language processing (NLP) tasks, particularly when fine-tuned on large, domain-relevant datasets. However, they remain vulnerable to backdoor attacks, where adversaries embed malicious behaviors using trigger patterns in the training data. These triggers remain dormant during normal usage, but, when activated, can cause targeted misclassifications. In this work, we investigate the internal behavior of backdoored pre-trained encoder-based language models, focusing on the consistent shift in attention and gradient attribution when processing poisoned inputs; where the trigger token dominates both attention and gradient signals, overriding the surrounding context. We propose an inference-time defense that constructs anomaly scores by combining token-level attention and gradient information. Extensive experiments on text classification tasks across diverse backdoor attack scenarios demonstrate that our method significantly reduces attack success rates compared to existing baselines. Furthermore, we provide an interpretability-driven analysis of the scoring mechanism, shedding light on trigger localization and the robustness of the proposed defense.
Key Contributions
- Identifies and formalizes attention drifting and gradient dominance as consistent internal signatures of backdoor triggers across encoder-based PLMs beyond BERT
- Proposes an inference-time anomaly scoring mechanism that fuses token-level attention weights and gradient attributions to detect poisoned inputs without modifying the model or requiring clean/poisoned data splits
- Provides an interpretability-driven analysis of the scoring mechanism that localizes trigger tokens and explains defense decisions
🛡️ Threat Analysis
The paper directly targets backdoor attacks on pre-trained language models — adversaries embed hidden trigger patterns in training data that cause targeted misclassifications when activated. The proposed defense detects these backdoor triggers at inference time via attention-gradient anomaly scoring. This is a textbook ML10 scenario: hidden, trigger-activated malicious behavior with a trigger-reverse-engineering/detection defense.