Depth Charge: Jailbreak Large Language Models from Deep Safety Attention Heads

Currently, open-sourced large language models (OSLLMs) have demonstrated remarkable generative performance. However, as their structure and weights are made public, they are exposed to jailbreak attacks even after alignment. Existing attacks operate primarily at shallow levels, such as the prompt or embedding level, and often fail to expose vulnerabilities rooted in deeper model components, which creates a false sense of security for successful defense. In this paper, we propose \textbf{\underline{S}}afety \textbf{\underline{A}}ttention \textbf{\underline{H}}ead \textbf{\underline{A}}ttack (\textbf{SAHA}), an attention-head-level jailbreak framework that explores the vulnerability in deeper but insufficiently aligned attention heads. SAHA contains two novel designs. Firstly, we reveal that deeper attention layers introduce more vulnerability against jailbreak attacks. Based on this finding, \textbf{SAHA} introduces \textit{Ablation-Impact Ranking} head selection strategy to effectively locate the most vital layer for unsafe output. Secondly, we introduce a boundary-aware perturbation method, \textit{i.e. Layer-Wise Perturbation}, to probe the generation of unsafe content with minimal perturbation to the attention. This constrained perturbation guarantees higher semantic relevance with the target intent while ensuring evasion. Extensive experiments show the superiority of our method: SAHA improves ASR by 14\% over SOTA baselines, revealing the vulnerability of the attack surface on the attention head. Our code is available at https://anonymous.4open.science/r/SAHA.

Key Contributions

Ablation-Impact Ranking: a head-selection strategy that identifies safety-critical attention heads concentrated in deeper layers as the primary vulnerability surface.
Layer-Wise Perturbation: a boundary-aware perturbation method that minimally modifies attention head activations to maximally elicit unsafe content while preserving semantic coherence.
Empirical finding that deeper attention layers are systematically less robustly aligned, enabling a 14% ASR improvement over state-of-the-art jailbreak baselines.

🛡️ Threat Analysis

Input Manipulation Attack

Uses optimization-based gradient perturbations applied to internal attention head components at inference time — analogous to adversarial perturbation attacks but operating on model internals rather than the input surface, with the goal of causing unsafe/misaligned output generation.

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

white_boxinference_timetargeted

Applications

2025 2 cit.

Input Manipulation Attack

93%

Depth Charge: Jailbreak Large Language Models from Deep Safety Attention Heads

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

RAID: Refusal-Aware and Integrated Decoding for Jailbreaking LLMs

H-Node Attack and Defense in Large Language Models

Embedding Poisoning: Bypassing Safety Alignment via Embedding Semantic Shift

Claudini: Autoresearch Discovers State-of-the-Art Adversarial Attack Algorithms for LLMs

Language Model Inversion through End-to-End Differentiation

TAO-Attack: Toward Advanced Optimization-Based Jailbreak Attacks for Large Language Models

Reliability Crisis of Reference-free Metrics for Grammatical Error Correction

Dynamic Target Attack