defense 2025

When Models Outthink Their Safety: Unveiling and Mitigating Self-Jailbreak in Large Reasoning Models

Yingzhi Mao ^1,2, Chunkang Zhang ¹, Junxiang Wang ¹, Xinyan Guan ^1,2, Boxi Cao ¹, Yaojie Lu ¹, Hongyu Lin ¹, Xianpei Han ^1,2, Le Sun ^1,2

¹ Institute of Software, Chinese Academy of Sciences

² University of Chinese Academy of Sciences

0 citations · 31 references · arXiv (Cornell University)

Published on arXiv

2510.21285

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Self-Jailbreak accounts for ~80% of unsafe outputs in LRMs; CoG reduces attack success rates to competitive levels while substantially improving reasoning accuracy compared to existing safety methods.

Chain-of-Guardrail (CoG)

Novel technique introduced

Large Reasoning Models (LRMs) achieve strong performance on complex multi-step reasoning, yet they still exhibit severe safety failures such as harmful content generation. Existing methods often apply coarse-grained constraints over the entire reasoning trajectories, which can undermine reasoning capability while failing to address the root causes of unsafe behavior. In this work, we uncover a previously underexplored failure mode in LRMs, termed Self-Jailbreak, where models initially recognize the harmful intent of a query, but override this judgment during subsequent reasoning steps, ultimately generating unsafe outputs. Such a phenomenon reveals that LRMs are capable of recognizing harm, while safety failures primarily arise from reasoning steps. Motivated by this finding, we propose \emph{Chain-of-Guardrail} (CoG), a trajectory-level training framework that mitigates Self-Jailbreak via targeted, step-level interventions while maintaining reasoning ability. Experiments across multiple safety and reasoning benchmarks indicate that CoG achieves a favorable balance between safety and reasoning performance compared with existing approaches.

Key Contributions

Identifies and characterizes 'Self-Jailbreak' — a failure mode where LRMs initially recognize harmful intent but override that safety judgment during subsequent reasoning steps, accounting for ~80% of unsafe outputs analyzed
Proposes Chain-of-Guardrail (CoG), a trajectory-level training framework with targeted step-level interventions (Safety Recomposition and Safety Backtrack) that correct only failure-inducing reasoning segments
Empirically demonstrates CoG achieves a superior safety–reasoning balance vs. prior methods, e.g., on Qwen3-32B improving GPQA-Diamond from 54.30 to 62.38 and AIME2024 from 71.70 to 82.08 while matching SafeKey's safety performance

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

inference_time

Datasets

WildJailbreakGPQA-DiamondAIME2024

Applications

large reasoning modelschatbotsautonomous agents

Read PDF arXiv DOI

When Models Outthink Their Safety: Unveiling and Mitigating Self-Jailbreak in Large Reasoning Models

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Safe Transformer: An Explicit Safety Bit For Interpretable And Controllable Alignment

Defensive M2S: Training Guardrail Models on Compressed Multi-turn Conversations

AttnTrace: Attention-based Context Traceback for Long-Context LLMs

CNT: Safety-oriented Function Reuse across LLMs via Cross-Model Neuron Transfer

Contrastive Reasoning Alignment: Reinforcement Learning from Hidden Representations

MENTOR: A Metacognition-Driven Self-Evolution Framework for Uncovering and Mitigating Implicit Domain Risks in LLMs

STAR-S: Improving Safety Alignment through Self-Taught Reasoning on Safety Rules

Trust The Typical