Trigger Where It Hurts: Unveiling Hidden Backdoors through Sensitivity with Sensitron
Gejian Zhao , Hanzhou Wu , Xinpeng Zhang
Published on arXiv
2509.19101
Model Poisoning
OWASP ML Top 10 — ML10
Key Finding
Achieves 97.8% Attack Success Rate (+5.8% over SOTA) with 85.4% ASR at only 0.1% poisoning rate while evading multiple SOTA backdoor defenses.
Sensitron (DMSA + H-SHAP + Plug-and-Rank)
Novel technique introduced
Backdoor attacks pose a significant security threat to natural language processing (NLP) systems, but existing methods lack explainable trigger mechanisms and fail to quantitatively model vulnerability patterns. This work pioneers the quantitative connection between explainable artificial intelligence (XAI) and backdoor attacks, introducing Sensitron, a novel modular framework for crafting stealthy and robust backdoor triggers. Sensitron employs a progressive refinement approach where Dynamic Meta-Sensitivity Analysis (DMSA) first identifies potentially vulnerable input tokens, Hierarchical SHAP Estimation (H-SHAP) then provides explainable attribution to precisely pinpoint the most influential tokens, and finally a Plug-and-Rank mechanism that generates contextually appropriate triggers. We establish the first mathematical correlation (Sensitivity Ranking Correlation, SRC=0.83) between explainability scores and empirical attack success, enabling precise targeting of model vulnerabilities. Sensitron achieves 97.8% Attack Success Rate (ASR) (+5.8% over state-of-the-art (SOTA)) with 85.4% ASR at 0.1% poisoning rate, demonstrating robust resistance against multiple SOTA defenses. This work reveals fundamental NLP vulnerabilities and provides new attack vectors through weaponized explainability.
Key Contributions
- Sensitron framework integrating Dynamic Meta-Sensitivity Analysis (DMSA) and Hierarchical SHAP Estimation (H-SHAP) to identify the most vulnerable token positions for backdoor trigger placement
- First quantitative correlation (SRC=0.83) between explainability scores and empirical backdoor attack success, enabling principled vulnerability targeting
- Plug-and-Rank mechanism that generates contextually fluent multi-token triggers achieving 97.8% ASR (+5.8% over SOTA) and 85.4% ASR at an extremely low 0.1% poisoning rate
🛡️ Threat Analysis
Sensitron is explicitly a backdoor attack framework that embeds hidden triggers into NLP models during training, causing targeted misclassification only when triggers appear at inference time — the defining characteristic of ML10. The framework improves trigger stealthiness and effectiveness through XAI-guided token selection (DMSA + H-SHAP + Plug-and-Rank), achieving 97.8% ASR while maintaining normal behavior on clean inputs.