Trigger Where It Hurts: Unveiling Hidden Backdoors through Sensitivity with Sensitron

Backdoor attacks pose a significant security threat to natural language processing (NLP) systems, but existing methods lack explainable trigger mechanisms and fail to quantitatively model vulnerability patterns. This work pioneers the quantitative connection between explainable artificial intelligence (XAI) and backdoor attacks, introducing Sensitron, a novel modular framework for crafting stealthy and robust backdoor triggers. Sensitron employs a progressive refinement approach where Dynamic Meta-Sensitivity Analysis (DMSA) first identifies potentially vulnerable input tokens, Hierarchical SHAP Estimation (H-SHAP) then provides explainable attribution to precisely pinpoint the most influential tokens, and finally a Plug-and-Rank mechanism that generates contextually appropriate triggers. We establish the first mathematical correlation (Sensitivity Ranking Correlation, SRC=0.83) between explainability scores and empirical attack success, enabling precise targeting of model vulnerabilities. Sensitron achieves 97.8% Attack Success Rate (ASR) (+5.8% over state-of-the-art (SOTA)) with 85.4% ASR at 0.1% poisoning rate, demonstrating robust resistance against multiple SOTA defenses. This work reveals fundamental NLP vulnerabilities and provides new attack vectors through weaponized explainability.

Key Contributions

Sensitron framework integrating Dynamic Meta-Sensitivity Analysis (DMSA) and Hierarchical SHAP Estimation (H-SHAP) to identify the most vulnerable token positions for backdoor trigger placement
First quantitative correlation (SRC=0.83) between explainability scores and empirical backdoor attack success, enabling principled vulnerability targeting
Plug-and-Rank mechanism that generates contextually fluent multi-token triggers achieving 97.8% ASR (+5.8% over SOTA) and 85.4% ASR at an extremely low 0.1% poisoning rate

🛡️ Threat Analysis

Model Poisoning

Sensitron is explicitly a backdoor attack framework that embeds hidden triggers into NLP models during training, causing targeted misclassification only when triggers appear at inference time — the defining characteristic of ML10. The framework improves trigger stealthiness and effectiveness through XAI-guided token selection (DMSA + H-SHAP + Plug-and-Rank), achieving 97.8% ASR while maintaining normal behavior on clean inputs.

Details

Domains

nlp

Model Types

transformerllm

Threat Tags

white_boxtraining_timetargeted

Applications

2025 4 cit.

Model Poisoning

83%

Trigger Where It Hurts: Unveiling Hidden Backdoors through Sensitivity with Sensitron

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Who Speaks for the Trigger? Dynamic Expert Routing in Backdoored Mixture-of-Experts Transformers

SASER: Stego attacks on open-source LLMs

Fewer Weights, More Problems: A Practical Attack on LLM Pruning

Adversarial Contrastive Learning for LLM Quantization Attacks

The Achilles' Heel of LLMs: How Altering a Handful of Neurons Can Cripple Language Abilities

TFL: Targeted Bit-Flip Attack on Large Language Model

Backdooring Bias in Large Language Models

SBFA: Single Sneaky Bit Flip Attack to Break Large Language Models