TraceGuard: Process-Guided Firewall against Reasoning Backdoors in Large Language Models

The deployment of Large Reasoning Models (LRMs) in high-stakes decision-making pipelines has introduced a novel and opaque attack surface: reasoning backdoors. In these attacks, the model's intermediate Chain-of-Thought (CoT) is manipulated to provide a linguistically plausible but logically fallacious justification for a malicious conclusion. While frontier models exhibit an intrinsic capacity to detect these fractures, compact, deployable models suffer from a fundamental verification gap, relying on fragile lexical heuristics that are easily bypassed by motivated adversaries. To bridge this gap, we propose TraceGuard, a process-guided security framework that transforms small-scale models into robust reasoning firewalls. Our approach treats the reasoning trace as an untrusted payload and establishes a defense-in-depth strategy through three synergistic phases: (1) Automated Forensic Synthesis, which generates contrastive reasoning pairs to isolate the specific logical point of fracture; (2) Step-Aware Supervised Fine-Tuning (SSFT), to instill a structural verification grammar; and (3) Verifier-Guided Reinforcement Learning (VGRL), utilizing Group Relative Policy Optimization. We identify and mitigate a critical failure mode of baseline alignment - lexical overfitting - whereby verifiers memorize adversarial triggers rather than auditing logical integrity. Our empirical evaluation demonstrates that TraceGuard acts as a security force multiplier: a 4B-parameter verifier achieves forensic precision on unseen attacks - including latent backdoors and post-hoc rationalizations - that rivals architectures two orders of magnitude larger. We further demonstrate robustness against adaptive adversaries in a grey-box setting, establishing TraceGuard as a viable, low-latency security primitive for the Trusted Computing Base.

Key Contributions

Automated Forensic Synthesis pipeline that generates contrastive reasoning pairs to isolate the specific logical 'point of fracture' in manipulated CoT traces
Step-Aware Supervised Fine-Tuning (SSFT) + Verifier-Guided Reinforcement Learning (VGRL) using GRPO to train compact verifiers that audit logical structure rather than memorizing adversarial trigger patterns
Identification and mitigation of 'lexical overfitting' — a critical failure mode where baseline verifiers learn surface-level trigger patterns instead of genuine logical integrity checks

🛡️ Threat Analysis

Model Poisoning

The paper's primary threat model is 'reasoning backdoors' — training-time backdoors embedded in LRMs that activate hidden malicious behavior through manipulated Chain-of-Thought traces. TraceGuard is a defense that detects latent backdoors and post-hoc CoT rationalizations, directly addressing the backdoor/trojan threat class.