defense 2026

Beyond Content Safety: Real-Time Monitoring for Reasoning Vulnerabilities in Large Language Models

Xunguang Wang ¹, Yuguang Zhou ¹, Qingyue Wang ¹, Zongjie Li ¹, Ruixuan Huang ¹, Zhenlan Ji ¹, Pingchuan Ma ², Shuai Wang ¹

¹ The Hong Kong University of Science and Technology

² Zhejiang University of Technology

0 citations

Published on arXiv

2603.25412

Prompt Injection

OWASP LLM Top 10 — LLM01

Model Denial of Service

OWASP LLM Top 10 — LLM04

Key Finding

Achieves 84.88% step-level localization accuracy and 85.37% error-type classification accuracy, outperforming hallucination detectors and process reward model baselines

Reasoning Safety Monitor

Novel technique introduced

Large language models (LLMs) increasingly rely on explicit chain-of-thought (CoT) reasoning to solve complex tasks, yet the safety of the reasoning process itself remains largely unaddressed. Existing work on LLM safety focuses on content safety--detecting harmful, biased, or factually incorrect outputs -- and treats the reasoning chain as an opaque intermediate artifact. We identify reasoning safety as an orthogonal and equally critical security dimension: the requirement that a model's reasoning trajectory be logically consistent, computationally efficient, and resistant to adversarial manipulation. We make three contributions. First, we formally define reasoning safety and introduce a nine-category taxonomy of unsafe reasoning behaviors, covering input parsing errors, reasoning execution errors, and process management errors. Second, we conduct a large-scale prevalence study annotating 4111 reasoning chains from both natural reasoning benchmarks and four adversarial attack methods (reasoning hijacking and denial-of-service), confirming that all nine error types occur in practice and that each attack induces a mechanistically interpretable signature. Third, we propose a Reasoning Safety Monitor: an external LLM-based component that runs in parallel with the target model, inspects each reasoning step in real time via a taxonomy-embedded prompt, and dispatches an interrupt signal upon detecting unsafe behavior. Evaluation on a 450-chain static benchmark shows that our monitor achieves up to 84.88\% step-level localization accuracy and 85.37\% error-type classification accuracy, outperforming hallucination detectors and process reward model baselines by substantial margins. These results demonstrate that reasoning-level monitoring is both necessary and practically achievable, and establish reasoning safety as a foundational concern for the secure deployment of large reasoning models.

Key Contributions

Formal definition of reasoning safety with a nine-category taxonomy of unsafe reasoning behaviors (input parsing errors, reasoning execution errors, process management errors)
Large-scale prevalence study annotating 4111 reasoning chains from natural benchmarks and four adversarial attack methods, showing mechanistically interpretable attack signatures
Reasoning Safety Monitor: an external LLM-based component that inspects reasoning steps in real-time and detects unsafe behavior with 84.88% step-level localization accuracy and 85.37% error-type classification accuracy

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llm

Threat Tags

inference_timeblack_box

Datasets

450-chain static benchmark4111 annotated reasoning chains from natural reasoning benchmarks

Applications

chain-of-thought reasoningllm reasoning safety

Read PDF arXiv

Beyond Content Safety: Real-Time Monitoring for Reasoning Vulnerabilities in Large Language Models

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

CODE: A Contradiction-Based Deliberation Extension Framework for Overthinking Attacks on Retrieval-Augmented Generation

Cross-Service Threat Intelligence in LLM Services using Privacy-Preserving Fingerprints

Prompt Injection Mitigation with Agentic AI, Nested Learning, and AI Sustainability via Semantic Caching

Peak + Accumulation: A Proxy-Level Scoring Formula for Multi-Turn LLM Attack Detection

AttriGuard: Defeating Indirect Prompt Injection in LLM Agents via Causal Attribution of Tool Invocations

Indirect Prompt Injections: Are Firewalls All You Need, or Stronger Benchmarks?

Prompt Fencing: A Cryptographic Approach to Establishing Security Boundaries in Large Language Model Prompts

Broken-Token: Filtering Obfuscated Prompts by Counting Characters-Per-Token