Understanding Sensitivity of Differential Attention through the Lens of Adversarial Robustness
Tsubasa Takahashi 1, Shojiro Yamabe 1,2, Futa Waseda 1,3, Kento Sasaki 1,4
Published on arXiv
2510.00517
Input Manipulation Attack
OWASP ML Top 10 — ML01
Key Finding
Differential Attention models are more adversarially vulnerable than standard attention due to structural negative gradient alignment that amplifies gradient norms, though shallow DA stacks show partial robustness to small-budget attacks via depth-dependent noise cancellation
Fragile Principle
Novel technique introduced
Differential Attention (DA) has been proposed as a refinement to standard attention, suppressing redundant or noisy context through a subtractive structure and thereby reducing contextual hallucination. While this design sharpens task-relevant focus, we show that it also introduces a structural fragility under adversarial perturbations. Our theoretical analysis identifies negative gradient alignment-a configuration encouraged by DA's subtraction-as the key driver of sensitivity amplification, leading to increased gradient norms and elevated local Lipschitz constants. We empirically validate this Fragile Principle through systematic experiments on ViT/DiffViT and evaluations of pretrained CLIP/DiffCLIP, spanning five datasets in total. These results demonstrate higher attack success rates, frequent gradient opposition, and stronger local sensitivity compared to standard attention. Furthermore, depth-dependent experiments reveal a robustness crossover: stacking DA layers attenuates small perturbations via depth-dependent noise cancellation, though this protection fades under larger attack budgets. Overall, our findings uncover a fundamental trade-off: DA improves discriminative focus on clean inputs but increases adversarial vulnerability, underscoring the need to jointly design for selectivity and robustness in future attention mechanisms.
Key Contributions
- Theoretical identification of 'negative gradient alignment' in Differential Attention as the structural cause of adversarial sensitivity amplification, formalized through local Lipschitz constant analysis
- Empirical validation of the 'Fragile Principle' showing DiffViT and DiffCLIP exhibit higher adversarial attack success rates and gradient norms than standard attention counterparts across five datasets
- Discovery of a depth-dependent robustness crossover: stacking DA layers attenuates small perturbations via noise cancellation but this protection collapses under larger attack budgets
🛡️ Threat Analysis
Paper characterizes adversarial input perturbation vulnerability of DA-based models (DiffViT, DiffCLIP), measuring higher attack success rates, increased gradient norms, and elevated local Lipschitz constants — directly analyzing the adversarial example threat surface created by the subtractive attention design.