Backdoor Directions in Vision Transformers

This paper investigates how Backdoor Attacks are represented within Vision Transformers (ViTs). By assuming knowledge of the trigger, we identify a specific ``trigger direction'' in the model's activations that corresponds to the internal representation of the trigger. We confirm the causal role of this linear direction by showing that interventions in both activation and parameter space consistently modulate the model's backdoor behavior across multiple datasets and attack types. Using this direction as a diagnostic tool, we trace how backdoor features are processed across layers. Our analysis reveals distinct qualitative differences: static-patch triggers follow a different internal logic than stealthy, distributed triggers. We further examine the link between backdoors and adversarial attacks, specifically testing whether PGD-based perturbations (de-)activate the identified trigger mechanism. Finally, we propose a data-free, weight-based detection scheme for stealthy-trigger attacks. Our findings show that mechanistic interpretability offers a robust framework for diagnosing and addressing security vulnerabilities in computer vision.

Key Contributions

Derives a 'backdoor direction' in ViT residual stream activations using trigger knowledge, and shows that orthogonalizing this direction out of weight matrices removes backdoor behavior across multiple datasets and attack types.
Reveals qualitative differences in internal representations between static-patch triggers and stealthy distributed triggers via layer-wise propagation analysis.
Proposes a data-free, weight-based detection scheme for stealthy-trigger backdoor attacks in ViTs, extending prior convolutional-model approaches to the transformer setting.

🛡️ Threat Analysis

Model Poisoning

The paper directly investigates backdoor/trojan attacks in Vision Transformers: it identifies causal 'trigger directions' in model activations, shows that removing these directions from weights mitigates backdoor behavior, traces how backdoor features propagate across layers, and proposes a data-free weight-based detection scheme — all core ML10 contributions.