Safety Recovery in Reasoning Models Is Only a Few Early Steering Steps Away

Reinforcement learning (RL) based post-training for explicit chain-of-thought (e.g., GRPO) improves the reasoning ability of multimodal large-scale reasoning models (MLRMs). But recent evidence shows that it can simultaneously degrade safety alignment and increase jailbreak success rates. We propose SafeThink, a lightweight inference-time defense that treats safety recovery as a satisficing constraint rather than a maximization objective. SafeThink monitors the evolving reasoning trace with a safety reward model and conditionally injects an optimized short corrective prefix ("Wait, think safely") only when the safety threshold is violated. In our evaluations across six open-source MLRMs and four jailbreak benchmarks (JailbreakV-28K, Hades, FigStep, and MM-SafetyBench), SafeThink reduces attack success rates by 30-60% (e.g., LlamaV-o1: 63.33% to 5.74% on JailbreakV-28K, R1-Onevision: 69.07% to 5.65% on Hades) while preserving reasoning performance (MathVista accuracy: 65.20% to 65.00%). A key empirical finding from our experiments is that safety recovery is often only a few steering steps away: intervening in the first 1-3 reasoning steps typically suffices to redirect the full generation toward safe completions.

Key Contributions

SafeThink: a lightweight inference-time defense that monitors evolving reasoning traces with a safety reward model and conditionally injects a short corrective prefix ('Wait, think safely') only when safety threshold is violated
Satisficing principle for safety recovery — enforces a threshold constraint on safety reward rather than maximizing it, preserving reasoning quality
Empirical finding that intervening in the first 1–3 reasoning steps is sufficient to redirect the full generation toward safe completions across six open-source MLRMs

🛡️ Threat Analysis

Input Manipulation Attack

The paper defends against adversarial visual inputs to VLMs (typographic attacks in FigStep, adversarial SD-generated images in MM-SafetyBench) that manipulate model outputs — adversarial visual inputs designed to jailbreak VLMs qualify for ML01 in the dual multimodal tagging rule.

Details

Domains

multimodalnlp

Model Types

vlmllm

Threat Tags

inference_time

Datasets

JailbreakV-28KHadesFigStepMM-SafetyBenchMathVista

Applications

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Adversarial Intent is a Latent Variable: Stateful Trust Inference for Securing Multimodal Agentic RAG

VisuoAlign: Safety Alignment of LVLMs with Multimodal Tree Search

Rethinking Jailbreak Detection of Large Vision Language Models with Representational Contrastive Scoring

GuardAlign: Test-time Safety Alignment in Multimodal Large Language Models

DefenSee: Dissecting Threat from Sight and Text -- A Multi-View Defensive Pipeline for Multi-modal Jailbreaks

Directional Embedding Smoothing for Robust Vision Language Models

Learning to Detect Unseen Jailbreak Attacks in Large Vision-Language Models

VALD: Multi-Stage Vision Attack Detection for Efficient LVLM Defense