defense 2025

Refusal Falls off a Cliff: How Safety Alignment Fails in Reasoning?

2 citations · 84 references · arXiv

Published on arXiv

2510.06036

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Cliff-as-a-Judge achieves comparable safety improvements using only 1.7% of vanilla safety training data, and ablating 3% of identified refusal suppression heads reduces jailbreak attack success rates below 10%.

Cliff-as-a-Judge

Novel technique introduced

Large reasoning models (LRMs) with multi-step reasoning capabilities have shown remarkable problem-solving abilities, yet they exhibit concerning safety vulnerabilities that remain poorly understood. In this work, we investigate why safety alignment fails in reasoning models through a mechanistic interpretability lens. Using a linear probing approach to trace refusal intentions across token positions, we discover a striking phenomenon termed as \textbf{refusal cliff}: many poorly-aligned reasoning models correctly identify harmful prompts and maintain strong refusal intentions during their thinking process, but experience a sharp drop in refusal scores at the final tokens before output generation. This suggests that these models are not inherently unsafe; rather, their refusal intentions are systematically suppressed. Through causal intervention analysis, we identify a sparse set of attention heads that negatively contribute to refusal behavior. Ablating just 3\% of these heads can reduce attack success rates below 10\%. Building on these mechanistic insights, we propose \textbf{Cliff-as-a-Judge}, a novel data selection method that identifies training examples exhibiting the largest refusal cliff to efficiently repair reasoning models' safety alignment. This approach achieves comparable safety improvements using only 1.7\% of the vanilla safety training data, demonstrating a less-is-more effect in safety alignment.

Key Contributions

Discovery of the 'refusal cliff' phenomenon: reasoning LLMs correctly identify harmful prompts but systematically suppress refusal intentions at the final output tokens
Causal identification of sparse 'Refusal Suppression Heads' — ablating just 3% of these heads reduces attack success rates below 10%
Cliff-as-a-Judge: a data selection method exploiting refusal cliff magnitude to repair safety alignment using only 1.7% of standard safety training data

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

inference_timeblack_box

Applications

llm safety alignmentreasoning model safetyharmful prompt refusal

Read PDF arXiv DOI Code

Refusal Falls off a Cliff: How Safety Alignment Fails in Reasoning?

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Beyond Surface-Level Detection: Towards Cognitive-Driven Defense Against Jailbreak Attacks via Meta-Operations Reasoning

TRYLOCK: Defense-in-Depth Against LLM Jailbreaks via Layered Preference and Representation Engineering

Securing AI Agents Against Prompt Injection Attacks

Adversarial Distilled Retrieval-Augmented Guarding Model for Online Malicious Intent Detection

From static to adaptive: immune memory-based jailbreak detection for large language models

Efficient and Adaptable Detection of Malicious LLM Prompts via Bootstrap Aggregation

Think Twice, Generate Once: Safeguarding by Progressive Self-Reflection

PIShield: Detecting Prompt Injection Attacks via Intrinsic LLM Features