defense 2025

Towards Safe Reasoning in Large Reasoning Models via Corrective Intervention

2 citations · 1 influential · 70 references · arXiv

Published on arXiv

2509.24393

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

IPO achieves over 30% relative reduction in harmfulness on jailbreak and adversarial safety benchmarks compared to SFT-based and RL-based baselines while preserving performance on diverse reasoning tasks.

IPO (Intervened Preference Optimization)

Novel technique introduced

Although Large Reasoning Models (LRMs) have progressed in solving complex problems, their chain-of-thought (CoT) reasoning often contains harmful content that can persist even when the final responses appear safe. We show that this issue still remains in existing methods which overlook the unique significance of safe reasoning, undermining their trustworthiness and posing potential risks in applications if unsafe reasoning is accessible for and exploited by malicious users. We therefore shift our focus to aligning the safety of reasoning itself in this paper and explore process supervision as the solution. However, simply rewarding safe reasoning proves inadequate due to low rollout diversity and limited training signals. To tackle this challenge, we first delve into the characteristics of safe reasoning and uncover several critical insights that 1) safe reasoning is often consolidated by a few critical steps of safety triggers; 2) compliance cues strongly correlate with unsafe continuations; and 3) corrective interventions reliably steer unsafe trajectories towards safer traces. Motivated by these, we propose Intervened Preference Optimization (IPO), an alignment method that enforces safe reasoning by substituting compliance steps with safety triggers and constructing pairs for preference learning with strong signals. Experiments on jailbreak and adversarial safety benchmarks demonstrate that IPO remarkably improves overall safety regarding both reasoning and responses, outperforming SFT-based and RL-based baselines with a relative reduction of over 30% in harmfulness, while preserving excellent performance across diverse reasoning tasks. The results highlight the importance of explicit alignment for reasoning and provide a practical path to safer LRMs.

Key Contributions

Identifies that LRM chain-of-thought reasoning contains harmful content that persists even when final responses appear safe, a previously overlooked attack surface
Uncovers three critical insights about unsafe reasoning: safety consolidation at critical trigger steps, compliance cues predicting unsafe continuations, and corrective interventions reliably steering trajectories to safety
Proposes Intervened Preference Optimization (IPO), which constructs preference pairs by substituting compliance steps with safety triggers, achieving >30% relative reduction in harmfulness over SFT and RL baselines

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

inference_timetraining_time

Datasets

jailbreak benchmarksadversarial safety benchmarks

Applications

large reasoning modelsllm safety alignmentjailbreak defense

Read PDF arXiv DOI

Towards Safe Reasoning in Large Reasoning Models via Corrective Intervention

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

NeuronTune: Fine-Grained Neuron Modulation for Balanced Safety-Utility Alignment in LLMs

Deactivating Refusal Triggers: Understanding and Mitigating Overrefusal in Safety Alignment

Don't Walk the Line: Boundary Guidance for Filtered Generation

TRACEALIGN -- Tracing the Drift: Attributing Alignment Failures to Training-Time Belief Sources in LLMs

Efficient Switchable Safety Control in LLMs via Magic-Token-Guided Co-Training

Reasoning Up the Instruction Ladder for Controllable Language Models

Reasoned Safety Alignment: Ensuring Jailbreak Defense via Answer-Then-Check

AdvChain: Adversarial Chain-of-Thought Tuning for Robust Safety Alignment of Large Reasoning Models