Guided Perturbation Sensitivity (GPS): Detecting Adversarial Text via Embedding Stability and Word Importance

Adversarial text attacks remain a persistent threat to transformer models, yet existing defenses are typically attack-specific or require costly model retraining, leaving a gap for attack-agnostic detection. We introduce Guided Perturbation Sensitivity (GPS), a detection framework that identifies adversarial examples by measuring how embedding representations change when important words are masked. GPS first ranks words using importance heuristics, then measures embedding sensitivity to masking top-k critical words, and processes the resulting patterns with a BiLSTM detector. Experiments show that adversarially perturbed words exhibit disproportionately high masking sensitivity compared to naturally important words. Across three datasets, three attack types, and two victim models, GPS achieves over 85% detection accuracy and demonstrates competitive performance compared to existing state-of-the-art methods, often at lower computational cost. Using Normalized Discounted Cumulative Gain (NDCG) to measure perturbation identification quality, we demonstrate that gradient-based ranking significantly outperforms attention, hybrid, and random selection approaches, with identification quality strongly correlating with detection performance for word-level attacks ($ρ= 0.65$). GPS generalizes to unseen datasets, attacks, and models without retraining, providing a practical solution for adversarial text detection.

Key Contributions

GPS framework: ranks words by gradient-based importance, measures embedding sensitivity to top-k masking, and passes the resulting feature trace through a BiLSTM detector to classify adversarial vs. benign inputs
Empirical finding that adversarially substituted words exhibit disproportionately high masking sensitivity vs. naturally important words, providing a signal exploitable without attack-specific knowledge
Demonstration that NDCG-based perturbation identification quality (ρ=0.65) strongly correlates with detection performance, validating gradient-based ranking over attention-based and random baselines

🛡️ Threat Analysis

Input Manipulation Attack

GPS is a detection defense against adversarial text examples — word-substitution evasion attacks on transformer classifiers at inference time. The paper characterizes adversarial inputs via embedding instability and trains a BiLSTM detector to identify them, directly targeting the adversarial example threat.