Seeing No Evil: Blinding Large Vision-Language Models to Safety Instructions via Adversarial Attention Hijacking
Jingru Li 1, Wei Ren 1, Tianqing Zhu 2
Published on arXiv
2604.10299
Input Manipulation Attack
OWASP ML Top 10 — ML01
Prompt Injection
OWASP LLM Top 10 — LLM01
Key Finding
Achieves 94.4% attack success rate on Qwen-VL with 40% fewer iterations than baselines; suppresses system-prompt attention by 80%
Attention-Guided Visual Jailbreaking
Novel technique introduced
Large Vision-Language Models (LVLMs) rely on attention-based retrieval of safety instructions to maintain alignment during generation. Existing attacks typically optimize image perturbations to maximize harmful output likelihood, but suffer from slow convergence due to gradient conflict between adversarial objectives and the model's safety-retrieval mechanism. We propose Attention-Guided Visual Jailbreaking, which circumvents rather than overpowers safety alignment by directly manipulating attention patterns. Our method introduces two simple auxiliary objectives: (1) suppressing attention to alignment-relevant prefix tokens and (2) anchoring generation on adversarial image features. This simple yet effective push-pull formulation reduces gradient conflict by 45% and achieves 94.4% attack success rate on Qwen-VL (vs. 68.8% baseline) with 40% fewer iterations. At tighter perturbation budgets ($ε=8/255$), we maintain 59.0% ASR compared to 45.7% for standard methods. Mechanistic analysis reveals a failure mode we term safety blindness: successful attacks suppress system-prompt attention by 80%, causing models to generate harmful content not by overriding safety rules, but by failing to retrieve them.
Key Contributions
- Attention-guided adversarial attack that suppresses safety-instruction retrieval rather than overpowering it
- Dual-objective formulation (suppress alignment-token attention + anchor on adversarial features) reducing gradient conflict by 45%
- Identification of 'safety blindness' failure mode where models fail to retrieve safety rules rather than override them
🛡️ Threat Analysis
Gradient-based adversarial image perturbations that cause misalignment/harmful outputs at inference time by manipulating internal attention patterns.