defense 2025

When Models Outthink Their Safety: Unveiling and Mitigating Self-Jailbreak in Large Reasoning Models

Yingzhi Mao 1,2, Chunkang Zhang 1, Junxiang Wang 1, Xinyan Guan 1,2, Boxi Cao 1, Yaojie Lu 1, Hongyu Lin 1, Xianpei Han 1,2, Le Sun 1,2

0 citations · 31 references · arXiv (Cornell University)

α

Published on arXiv

2510.21285

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Self-Jailbreak accounts for ~80% of unsafe outputs in LRMs; CoG reduces attack success rates to competitive levels while substantially improving reasoning accuracy compared to existing safety methods.

Chain-of-Guardrail (CoG)

Novel technique introduced


Large Reasoning Models (LRMs) achieve strong performance on complex multi-step reasoning, yet they still exhibit severe safety failures such as harmful content generation. Existing methods often apply coarse-grained constraints over the entire reasoning trajectories, which can undermine reasoning capability while failing to address the root causes of unsafe behavior. In this work, we uncover a previously underexplored failure mode in LRMs, termed Self-Jailbreak, where models initially recognize the harmful intent of a query, but override this judgment during subsequent reasoning steps, ultimately generating unsafe outputs. Such a phenomenon reveals that LRMs are capable of recognizing harm, while safety failures primarily arise from reasoning steps. Motivated by this finding, we propose \emph{Chain-of-Guardrail} (CoG), a trajectory-level training framework that mitigates Self-Jailbreak via targeted, step-level interventions while maintaining reasoning ability. Experiments across multiple safety and reasoning benchmarks indicate that CoG achieves a favorable balance between safety and reasoning performance compared with existing approaches.


Key Contributions

  • Identifies and characterizes 'Self-Jailbreak' — a failure mode where LRMs initially recognize harmful intent but override that safety judgment during subsequent reasoning steps, accounting for ~80% of unsafe outputs analyzed
  • Proposes Chain-of-Guardrail (CoG), a trajectory-level training framework with targeted step-level interventions (Safety Recomposition and Safety Backtrack) that correct only failure-inducing reasoning segments
  • Empirically demonstrates CoG achieves a superior safety–reasoning balance vs. prior methods, e.g., on Qwen3-32B improving GPQA-Diamond from 54.30 to 62.38 and AIME2024 from 71.70 to 82.08 while matching SafeKey's safety performance

🛡️ Threat Analysis


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
inference_time
Datasets
WildJailbreakGPQA-DiamondAIME2024
Applications
large reasoning modelschatbotsautonomous agents