defense 2026

Beyond Content Safety: Real-Time Monitoring for Reasoning Vulnerabilities in Large Language Models

Xunguang Wang 1, Yuguang Zhou 1, Qingyue Wang 1, Zongjie Li 1, Ruixuan Huang 1, Zhenlan Ji 1, Pingchuan Ma 2, Shuai Wang 1

0 citations

α

Published on arXiv

2603.25412

Prompt Injection

OWASP LLM Top 10 — LLM01

Model Denial of Service

OWASP LLM Top 10 — LLM04

Key Finding

Achieves 84.88% step-level localization accuracy and 85.37% error-type classification accuracy, outperforming hallucination detectors and process reward model baselines

Reasoning Safety Monitor

Novel technique introduced


Large language models (LLMs) increasingly rely on explicit chain-of-thought (CoT) reasoning to solve complex tasks, yet the safety of the reasoning process itself remains largely unaddressed. Existing work on LLM safety focuses on content safety--detecting harmful, biased, or factually incorrect outputs -- and treats the reasoning chain as an opaque intermediate artifact. We identify reasoning safety as an orthogonal and equally critical security dimension: the requirement that a model's reasoning trajectory be logically consistent, computationally efficient, and resistant to adversarial manipulation. We make three contributions. First, we formally define reasoning safety and introduce a nine-category taxonomy of unsafe reasoning behaviors, covering input parsing errors, reasoning execution errors, and process management errors. Second, we conduct a large-scale prevalence study annotating 4111 reasoning chains from both natural reasoning benchmarks and four adversarial attack methods (reasoning hijacking and denial-of-service), confirming that all nine error types occur in practice and that each attack induces a mechanistically interpretable signature. Third, we propose a Reasoning Safety Monitor: an external LLM-based component that runs in parallel with the target model, inspects each reasoning step in real time via a taxonomy-embedded prompt, and dispatches an interrupt signal upon detecting unsafe behavior. Evaluation on a 450-chain static benchmark shows that our monitor achieves up to 84.88\% step-level localization accuracy and 85.37\% error-type classification accuracy, outperforming hallucination detectors and process reward model baselines by substantial margins. These results demonstrate that reasoning-level monitoring is both necessary and practically achievable, and establish reasoning safety as a foundational concern for the secure deployment of large reasoning models.


Key Contributions

  • Formal definition of reasoning safety with a nine-category taxonomy of unsafe reasoning behaviors (input parsing errors, reasoning execution errors, process management errors)
  • Large-scale prevalence study annotating 4111 reasoning chains from natural benchmarks and four adversarial attack methods, showing mechanistically interpretable attack signatures
  • Reasoning Safety Monitor: an external LLM-based component that inspects reasoning steps in real-time and detects unsafe behavior with 84.88% step-level localization accuracy and 85.37% error-type classification accuracy

🛡️ Threat Analysis


Details

Domains
nlp
Model Types
llm
Threat Tags
inference_timeblack_box
Datasets
450-chain static benchmark4111 annotated reasoning chains from natural reasoning benchmarks
Applications
chain-of-thought reasoningllm reasoning safety