defense 2026

Safety Recovery in Reasoning Models Is Only a Few Early Steering Steps Away

Soumya Suvra Ghosal 1, Souradip Chakraborty 1, Vaibhav Singh 2, Furong Huang 1, Dinesh Manocha 1, Amrit Singh Bedi 3

0 citations · 72 references · arXiv (Cornell University)

α

Published on arXiv

2602.11096

Input Manipulation Attack

OWASP ML Top 10 — ML01

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Reduces attack success rates by 30–60% across six MLRMs (e.g., LlamaV-o1: 63.33% → 5.74% on JailbreakV-28K) while preserving reasoning performance (MathVista accuracy: 65.20% → 65.00%).

SafeThink

Novel technique introduced


Reinforcement learning (RL) based post-training for explicit chain-of-thought (e.g., GRPO) improves the reasoning ability of multimodal large-scale reasoning models (MLRMs). But recent evidence shows that it can simultaneously degrade safety alignment and increase jailbreak success rates. We propose SafeThink, a lightweight inference-time defense that treats safety recovery as a satisficing constraint rather than a maximization objective. SafeThink monitors the evolving reasoning trace with a safety reward model and conditionally injects an optimized short corrective prefix ("Wait, think safely") only when the safety threshold is violated. In our evaluations across six open-source MLRMs and four jailbreak benchmarks (JailbreakV-28K, Hades, FigStep, and MM-SafetyBench), SafeThink reduces attack success rates by 30-60% (e.g., LlamaV-o1: 63.33% to 5.74% on JailbreakV-28K, R1-Onevision: 69.07% to 5.65% on Hades) while preserving reasoning performance (MathVista accuracy: 65.20% to 65.00%). A key empirical finding from our experiments is that safety recovery is often only a few steering steps away: intervening in the first 1-3 reasoning steps typically suffices to redirect the full generation toward safe completions.


Key Contributions

  • SafeThink: a lightweight inference-time defense that monitors evolving reasoning traces with a safety reward model and conditionally injects a short corrective prefix ('Wait, think safely') only when safety threshold is violated
  • Satisficing principle for safety recovery — enforces a threshold constraint on safety reward rather than maximizing it, preserving reasoning quality
  • Empirical finding that intervening in the first 1–3 reasoning steps is sufficient to redirect the full generation toward safe completions across six open-source MLRMs

🛡️ Threat Analysis

Input Manipulation Attack

The paper defends against adversarial visual inputs to VLMs (typographic attacks in FigStep, adversarial SD-generated images in MM-SafetyBench) that manipulate model outputs — adversarial visual inputs designed to jailbreak VLMs qualify for ML01 in the dual multimodal tagging rule.


Details

Domains
multimodalnlp
Model Types
vlmllm
Threat Tags
inference_time
Datasets
JailbreakV-28KHadesFigStepMM-SafetyBenchMathVista
Applications
multimodal reasoning modelsvision-language modelschain-of-thought reasoning safety