Safety Recovery in Reasoning Models Is Only a Few Early Steering Steps Away
Soumya Suvra Ghosal 1, Souradip Chakraborty 1, Vaibhav Singh 2, Furong Huang 1, Dinesh Manocha 1, Amrit Singh Bedi 3
Published on arXiv
2602.11096
Input Manipulation Attack
OWASP ML Top 10 — ML01
Prompt Injection
OWASP LLM Top 10 — LLM01
Key Finding
Reduces attack success rates by 30–60% across six MLRMs (e.g., LlamaV-o1: 63.33% → 5.74% on JailbreakV-28K) while preserving reasoning performance (MathVista accuracy: 65.20% → 65.00%).
SafeThink
Novel technique introduced
Reinforcement learning (RL) based post-training for explicit chain-of-thought (e.g., GRPO) improves the reasoning ability of multimodal large-scale reasoning models (MLRMs). But recent evidence shows that it can simultaneously degrade safety alignment and increase jailbreak success rates. We propose SafeThink, a lightweight inference-time defense that treats safety recovery as a satisficing constraint rather than a maximization objective. SafeThink monitors the evolving reasoning trace with a safety reward model and conditionally injects an optimized short corrective prefix ("Wait, think safely") only when the safety threshold is violated. In our evaluations across six open-source MLRMs and four jailbreak benchmarks (JailbreakV-28K, Hades, FigStep, and MM-SafetyBench), SafeThink reduces attack success rates by 30-60% (e.g., LlamaV-o1: 63.33% to 5.74% on JailbreakV-28K, R1-Onevision: 69.07% to 5.65% on Hades) while preserving reasoning performance (MathVista accuracy: 65.20% to 65.00%). A key empirical finding from our experiments is that safety recovery is often only a few steering steps away: intervening in the first 1-3 reasoning steps typically suffices to redirect the full generation toward safe completions.
Key Contributions
- SafeThink: a lightweight inference-time defense that monitors evolving reasoning traces with a safety reward model and conditionally injects a short corrective prefix ('Wait, think safely') only when safety threshold is violated
- Satisficing principle for safety recovery — enforces a threshold constraint on safety reward rather than maximizing it, preserving reasoning quality
- Empirical finding that intervening in the first 1–3 reasoning steps is sufficient to redirect the full generation toward safe completions across six open-source MLRMs
🛡️ Threat Analysis
The paper defends against adversarial visual inputs to VLMs (typographic attacks in FigStep, adversarial SD-generated images in MM-SafetyBench) that manipulate model outputs — adversarial visual inputs designed to jailbreak VLMs qualify for ML01 in the dual multimodal tagging rule.