defense 2025

Safety Alignment Should Be Made More Than Just A Few Attention Heads

Chao Huang ^1,2, Zefeng Zhang ^1,2, Juewei Yue ^1,2, Quangang Li ^1,2, Chuang Zhang ^1,2, Tingwen Liu ^1,2

¹ Chinese Academy of Sciences

² University of Chinese Academy of Sciences

0 citations

Published on arXiv

2508.19697

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Models trained with AHD distribute safety representations across significantly more attention heads and exhibit considerably stronger robustness against mainstream jailbreak attacks while maintaining functional utility.

AHD (Attention Head-level Dropout)

Novel technique introduced

Current safety alignment for large language models(LLMs) continues to present vulnerabilities, given that adversarial prompting can effectively bypass their safety measures.Our investigation shows that these safety mechanisms predominantly depend on a limited subset of attention heads: removing or ablating these heads can severely compromise model safety. To identify and evaluate these safety-critical components, we introduce RDSHA, a targeted ablation method that leverages the model's refusal direction to pinpoint attention heads mostly responsible for safety behaviors. Further analysis shows that existing jailbreak attacks exploit this concentration by selectively bypassing or manipulating these critical attention heads. To address this issue, we propose AHD, a novel training strategy designed to promote the distributed encoding of safety-related behaviors across numerous attention heads. Experimental results demonstrate that AHD successfully distributes safety-related capabilities across more attention heads. Moreover, evaluations under several mainstream jailbreak attacks show that models trained with AHD exhibit considerably stronger safety robustness, while maintaining overall functional utility.

Key Contributions

RDSHA: a refusal-direction-guided ablation method to identify which attention heads are safety-critical, revealing that safety is concentrated in a dangerously small subset of heads
AHD (Attention Head-level Dropout): a training strategy that promotes distributed encoding of safety behaviors across many attention heads, reducing the attack surface for jailbreaks
Empirical analysis showing that existing jailbreak attacks exploit safety concentration by selectively suppressing the few critical attention heads

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

white_boxinference_timetraining_time

Applications

large language model safety alignmentjailbreak defense

Read PDF arXiv

Safety Alignment Should Be Made More Than Just A Few Attention Heads

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Collapse of Irrelevant Representations (CIR) Ensures Robust and Non-Disruptive LLM Unlearning

SafeLLM: Unlearning Harmful Outputs from Large Language Models against Jailbreak Attacks

CRISP: Persistent Concept Unlearning via Sparse Autoencoders

Unraveling LLM Jailbreaks Through Safety Knowledge Neurons

Safety Instincts: LLMs Learn to Trust Their Internal Compass for Self-Defense

Mitigating Jailbreaks with Intent-Aware LLMs

Beyond Surface Alignment: Rebuilding LLMs Safety Mechanism via Probabilistically Ablating Refusal Direction

Refusal Steering: Fine-grained Control over LLM Refusal Behaviour for Sensitive Topics