attack 2026

Depth Charge: Jailbreak Large Language Models from Deep Safety Attention Heads

Jinman Wu 1, Yi Xie 2, Shiqian Zhao 3, Xiaofeng Chen 1

0 citations

α

Published on arXiv

2603.05772

Input Manipulation Attack

OWASP ML Top 10 — ML01

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

SAHA achieves a 14% higher attack success rate than SOTA jailbreak baselines on open-source LLMs by exploiting deeper, insufficiently aligned attention heads.

SAHA (Safety Attention Head Attack)

Novel technique introduced


Currently, open-sourced large language models (OSLLMs) have demonstrated remarkable generative performance. However, as their structure and weights are made public, they are exposed to jailbreak attacks even after alignment. Existing attacks operate primarily at shallow levels, such as the prompt or embedding level, and often fail to expose vulnerabilities rooted in deeper model components, which creates a false sense of security for successful defense. In this paper, we propose \textbf{\underline{S}}afety \textbf{\underline{A}}ttention \textbf{\underline{H}}ead \textbf{\underline{A}}ttack (\textbf{SAHA}), an attention-head-level jailbreak framework that explores the vulnerability in deeper but insufficiently aligned attention heads. SAHA contains two novel designs. Firstly, we reveal that deeper attention layers introduce more vulnerability against jailbreak attacks. Based on this finding, \textbf{SAHA} introduces \textit{Ablation-Impact Ranking} head selection strategy to effectively locate the most vital layer for unsafe output. Secondly, we introduce a boundary-aware perturbation method, \textit{i.e. Layer-Wise Perturbation}, to probe the generation of unsafe content with minimal perturbation to the attention. This constrained perturbation guarantees higher semantic relevance with the target intent while ensuring evasion. Extensive experiments show the superiority of our method: SAHA improves ASR by 14\% over SOTA baselines, revealing the vulnerability of the attack surface on the attention head. Our code is available at https://anonymous.4open.science/r/SAHA.


Key Contributions

  • Ablation-Impact Ranking: a head-selection strategy that identifies safety-critical attention heads concentrated in deeper layers as the primary vulnerability surface.
  • Layer-Wise Perturbation: a boundary-aware perturbation method that minimally modifies attention head activations to maximally elicit unsafe content while preserving semantic coherence.
  • Empirical finding that deeper attention layers are systematically less robustly aligned, enabling a 14% ASR improvement over state-of-the-art jailbreak baselines.

🛡️ Threat Analysis

Input Manipulation Attack

Uses optimization-based gradient perturbations applied to internal attention head components at inference time — analogous to adversarial perturbation attacks but operating on model internals rather than the input surface, with the goal of causing unsafe/misaligned output generation.


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
white_boxinference_timetargeted
Applications
open-source llm safety alignmentllm jailbreaking