defense 2025

Spot Risks Before Speaking! Unraveling Safety Attention Heads in Large Vision-Language Models

Ziwei Zheng , Junyao Zhao , Le Yang , Lijun He , Fan Li

0 citations

α

Published on arXiv

2501.02029

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

A logistic regression detector built on safety attention head activations achieves strong zero-shot generalization against various prompt-based jailbreaking attacks on LVLMs.

Safety Attention Heads (SAHs)

Novel technique introduced


With the integration of an additional modality, large vision-language models (LVLMs) exhibit greater vulnerability to safety risks (e.g., jailbreaking) compared to their language-only predecessors. Although recent studies have devoted considerable effort to the post-hoc alignment of LVLMs, the inner safety mechanisms remain largely unexplored. In this paper, we discover that internal activations of LVLMs during the first token generation can effectively identify malicious prompts across different attacks. This inherent safety perception is governed by sparse attention heads, which we term ``safety heads." Further analysis reveals that these heads act as specialized shields against malicious prompts; ablating them leads to higher attack success rates, while the model's utility remains unaffected. By locating these safety heads and concatenating their activations, we construct a straightforward but powerful malicious prompt detector that integrates seamlessly into the generation process with minimal extra inference overhead. Despite its simple structure of a logistic regression model, the detector surprisingly exhibits strong zero-shot generalization capabilities. Experiments across various prompt-based attacks confirm the effectiveness of leveraging safety heads to protect LVLMs. Code is available at \url{https://github.com/Ziwei-Zheng/SAHs}.


Key Contributions

  • Discovery of 'safety heads' — sparse attention heads whose activations during first token generation reliably identify malicious prompts across diverse jailbreaking attacks
  • A lightweight malicious prompt detector (logistic regression on concatenated safety head activations) that integrates into generation with minimal overhead
  • Empirical evidence that ablating safety heads raises attack success rates without degrading model utility, confirming their specialized protective role

🛡️ Threat Analysis


Details

Domains
multimodalvisionnlp
Model Types
vlmtransformer
Threat Tags
inference_timeblack_box
Applications
large vision-language modelsjailbreak detectionlvlm safety