Test-Time Attention Purification for Backdoored Large Vision Language Models
Zhifang Zhang 1, Bojun Yang 2, Shuo He 3, Weitong Chen 3, Wei Emma Zhang 3, Olaf Maennel 1, Lei Feng 4, Miao Xu 4
Published on arXiv
2603.12989
Model Poisoning
OWASP ML Top 10 — ML10
Key Finding
Attention perturbation rapidly suppresses backdoor attack success rate in LVLMs while pixel perturbation barely reduces it, demonstrating backdoors operate through attention mechanisms not visual patterns
CleanSight
Novel technique introduced
Despite the strong multimodal performance, large vision-language models (LVLMs) are vulnerable during fine-tuning to backdoor attacks, where adversaries insert trigger-embedded samples into the training data to implant behaviors that can be maliciously activated at test time. Existing defenses typically rely on retraining backdoored parameters (e.g., adapters or LoRA modules) with clean data, which is computationally expensive and often degrades model performance. In this work, we provide a new mechanistic understanding of backdoor behaviors in LVLMs: the trigger does not influence prediction through low-level visual patterns, but through abnormal cross-modal attention redistribution, where trigger-bearing visual tokens steal attention away from the textual context - a phenomenon we term attention stealing. Motivated by this, we propose CleanSight, a training-free, plug-and-play defense that operates purely at test time. CleanSight (i) detects poisoned inputs based on the relative visual-text attention ratio in selected cross-modal fusion layers, and (ii) purifies the input by selectively pruning the suspicious high-attention visual tokens to neutralize the backdoor activation. Extensive experiments show that CleanSight significantly outperforms existing pixel-based purification defenses across diverse datasets and backdoor attack types, while preserving the model's utility on both clean and poisoned samples.
Key Contributions
- Mechanistic discovery that backdoors in LVLMs operate through cross-modal attention stealing rather than low-level visual patterns
- CleanSight: first training-free, test-time defense for backdoored LVLMs using attention-based detection and token pruning
- Significantly outperforms pixel-based purification defenses while preserving model utility on clean samples
🛡️ Threat Analysis
Paper directly addresses backdoor/trojan attacks in LVLMs where triggers embedded during fine-tuning cause malicious behavior at test time — core ML10 threat.