Backdoor Directions in Vision Transformers
Sengim Karayalcin 1,2, Marina Krcek 3, Pin-Yu Chen 4, Stjepan Picek 3,2
Published on arXiv
2603.10806
Model Poisoning
OWASP ML Top 10 — ML10
Key Finding
A single linear 'trigger direction' in ViT activations causally mediates backdoor behavior, and removing it from model weights suppresses backdoor misclassification while enabling weight-only detection of stealthy attacks.
Backdoor Direction (BD Direction)
Novel technique introduced
This paper investigates how Backdoor Attacks are represented within Vision Transformers (ViTs). By assuming knowledge of the trigger, we identify a specific ``trigger direction'' in the model's activations that corresponds to the internal representation of the trigger. We confirm the causal role of this linear direction by showing that interventions in both activation and parameter space consistently modulate the model's backdoor behavior across multiple datasets and attack types. Using this direction as a diagnostic tool, we trace how backdoor features are processed across layers. Our analysis reveals distinct qualitative differences: static-patch triggers follow a different internal logic than stealthy, distributed triggers. We further examine the link between backdoors and adversarial attacks, specifically testing whether PGD-based perturbations (de-)activate the identified trigger mechanism. Finally, we propose a data-free, weight-based detection scheme for stealthy-trigger attacks. Our findings show that mechanistic interpretability offers a robust framework for diagnosing and addressing security vulnerabilities in computer vision.
Key Contributions
- Derives a 'backdoor direction' in ViT residual stream activations using trigger knowledge, and shows that orthogonalizing this direction out of weight matrices removes backdoor behavior across multiple datasets and attack types.
- Reveals qualitative differences in internal representations between static-patch triggers and stealthy distributed triggers via layer-wise propagation analysis.
- Proposes a data-free, weight-based detection scheme for stealthy-trigger backdoor attacks in ViTs, extending prior convolutional-model approaches to the transformer setting.
🛡️ Threat Analysis
The paper directly investigates backdoor/trojan attacks in Vision Transformers: it identifies causal 'trigger directions' in model activations, shows that removing these directions from weights mitigates backdoor behavior, traces how backdoor features propagate across layers, and proposes a data-free weight-based detection scheme — all core ML10 contributions.