defense 2025

ReasoningGuard: Safeguarding Large Reasoning Models with Inference-time Safety Aha Moments

Yuquan Wang 1, Mi Zhang 1, Yining Wang 1, Geng Hong 1, Xiaoyu You 2, Min Yang 1

0 citations

α

Published on arXiv

2508.04204

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

ReasoningGuard achieves state-of-the-art jailbreak defense against three attack types while outperforming seven existing safeguards, with minimal additional inference cost and reduced over-refusal.

ReasoningGuard

Novel technique introduced


Large Reasoning Models (LRMs) have demonstrated impressive performance in reasoning-intensive tasks, but they remain vulnerable to harmful content generation, particularly in the mid-to-late steps of their reasoning processes. Existing defense mechanisms, however, rely on costly fine-tuning and additional expert knowledge, which restricts their scalability. In this work, we propose ReasoningGuard, an inference-time safeguard for LRMs, which injects timely safety aha moments to steer harmless while helpful reasoning processes. Leveraging the model's internal attention behavior, our approach accurately identifies critical points in the reasoning path, and triggers spontaneous, safety-oriented reflection. To safeguard both the subsequent reasoning steps and the final answers, we further implement a scaling sampling strategy during the decoding phase, selecting the optimal reasoning path. Inducing minimal extra inference cost, ReasoningGuard effectively mitigates three types of jailbreak attacks, including the latest ones targeting the reasoning process of LRMs. Our approach outperforms seven existing safeguards, achieving state-of-the-art safety defenses while effectively avoiding the common exaggerated safety issues.


Key Contributions

  • Inference-time safeguard that uses internal attention behavior to identify critical jailbreak-vulnerable points mid-reasoning in LRMs and injects safety-oriented 'aha moment' reflections
  • Scaling sampling strategy during decoding that selects the optimal (safest and most helpful) reasoning path among candidates
  • Outperforms seven existing safeguards on three categories of jailbreak attacks — including attacks targeting the reasoning process specifically — while minimizing exaggerated safety refusals

🛡️ Threat Analysis


Details

Domains
nlp
Model Types
llm
Threat Tags
inference_timewhite_box
Datasets
AdvBench
Applications
large reasoning modelsllm safetyjailbreak defense