attack 2025

Trigger Where It Hurts: Unveiling Hidden Backdoors through Sensitivity with Sensitron

Gejian Zhao , Hanzhou Wu , Xinpeng Zhang

0 citations · 47 references · arXiv

α

Published on arXiv

2509.19101

Model Poisoning

OWASP ML Top 10 — ML10

Key Finding

Achieves 97.8% Attack Success Rate (+5.8% over SOTA) with 85.4% ASR at only 0.1% poisoning rate while evading multiple SOTA backdoor defenses.

Sensitron (DMSA + H-SHAP + Plug-and-Rank)

Novel technique introduced


Backdoor attacks pose a significant security threat to natural language processing (NLP) systems, but existing methods lack explainable trigger mechanisms and fail to quantitatively model vulnerability patterns. This work pioneers the quantitative connection between explainable artificial intelligence (XAI) and backdoor attacks, introducing Sensitron, a novel modular framework for crafting stealthy and robust backdoor triggers. Sensitron employs a progressive refinement approach where Dynamic Meta-Sensitivity Analysis (DMSA) first identifies potentially vulnerable input tokens, Hierarchical SHAP Estimation (H-SHAP) then provides explainable attribution to precisely pinpoint the most influential tokens, and finally a Plug-and-Rank mechanism that generates contextually appropriate triggers. We establish the first mathematical correlation (Sensitivity Ranking Correlation, SRC=0.83) between explainability scores and empirical attack success, enabling precise targeting of model vulnerabilities. Sensitron achieves 97.8% Attack Success Rate (ASR) (+5.8% over state-of-the-art (SOTA)) with 85.4% ASR at 0.1% poisoning rate, demonstrating robust resistance against multiple SOTA defenses. This work reveals fundamental NLP vulnerabilities and provides new attack vectors through weaponized explainability.


Key Contributions

  • Sensitron framework integrating Dynamic Meta-Sensitivity Analysis (DMSA) and Hierarchical SHAP Estimation (H-SHAP) to identify the most vulnerable token positions for backdoor trigger placement
  • First quantitative correlation (SRC=0.83) between explainability scores and empirical backdoor attack success, enabling principled vulnerability targeting
  • Plug-and-Rank mechanism that generates contextually fluent multi-token triggers achieving 97.8% ASR (+5.8% over SOTA) and 85.4% ASR at an extremely low 0.1% poisoning rate

🛡️ Threat Analysis

Model Poisoning

Sensitron is explicitly a backdoor attack framework that embeds hidden triggers into NLP models during training, causing targeted misclassification only when triggers appear at inference time — the defining characteristic of ML10. The framework improves trigger stealthiness and effectiveness through XAI-guided token selection (DMSA + H-SHAP + Plug-and-Rank), achieving 97.8% ASR while maintaining normal behavior on clean inputs.


Details

Domains
nlp
Model Types
transformerllm
Threat Tags
white_boxtraining_timetargeted
Applications
text classificationsentiment analysisnlp systemslanguage model fine-tuning