Refusal Falls off a Cliff: How Safety Alignment Fails in Reasoning?
Qingyu Yin 1,2, Chak Tou Leong 3, Linyi Yang 4, Wenxuan Huang 3, Wenjie Li 5, Xiting Wang 6, Jaehong Yoon 7, YunXing 2, XingYu 2, Jinjin Gu 8
3 Hong Kong Polytechnic University
4 East China Normal University
5 Southern University of Science and Technology
7 Nanyang Technological University
8 INSAIT
Published on arXiv
2510.06036
Prompt Injection
OWASP LLM Top 10 — LLM01
Key Finding
Cliff-as-a-Judge achieves comparable safety improvements using only 1.7% of vanilla safety training data, and ablating 3% of identified refusal suppression heads reduces jailbreak attack success rates below 10%.
Cliff-as-a-Judge
Novel technique introduced
Large reasoning models (LRMs) with multi-step reasoning capabilities have shown remarkable problem-solving abilities, yet they exhibit concerning safety vulnerabilities that remain poorly understood. In this work, we investigate why safety alignment fails in reasoning models through a mechanistic interpretability lens. Using a linear probing approach to trace refusal intentions across token positions, we discover a striking phenomenon termed as \textbf{refusal cliff}: many poorly-aligned reasoning models correctly identify harmful prompts and maintain strong refusal intentions during their thinking process, but experience a sharp drop in refusal scores at the final tokens before output generation. This suggests that these models are not inherently unsafe; rather, their refusal intentions are systematically suppressed. Through causal intervention analysis, we identify a sparse set of attention heads that negatively contribute to refusal behavior. Ablating just 3\% of these heads can reduce attack success rates below 10\%. Building on these mechanistic insights, we propose \textbf{Cliff-as-a-Judge}, a novel data selection method that identifies training examples exhibiting the largest refusal cliff to efficiently repair reasoning models' safety alignment. This approach achieves comparable safety improvements using only 1.7\% of the vanilla safety training data, demonstrating a less-is-more effect in safety alignment.
Key Contributions
- Discovery of the 'refusal cliff' phenomenon: reasoning LLMs correctly identify harmful prompts but systematically suppress refusal intentions at the final output tokens
- Causal identification of sparse 'Refusal Suppression Heads' — ablating just 3% of these heads reduces attack success rates below 10%
- Cliff-as-a-Judge: a data selection method exploiting refusal cliff magnitude to repair safety alignment using only 1.7% of standard safety training data