TraceGuard: Process-Guided Firewall against Reasoning Backdoors in Large Language Models
Zhen Guo 1, Shanghao Shi 2, Hao Li 1, Shamim Yazdani 2, Ning Zhang 2, Reza Tourani 1
Published on arXiv
2603.02436
Model Poisoning
OWASP ML Top 10 — ML10
Prompt Injection
OWASP LLM Top 10 — LLM01
Key Finding
A 4B-parameter TraceGuard verifier achieves forensic precision on unseen reasoning backdoor attacks — including latent backdoors and post-hoc rationalizations — that rivals architectures two orders of magnitude larger, while maintaining robustness against grey-box adaptive adversaries.
TraceGuard
Novel technique introduced
The deployment of Large Reasoning Models (LRMs) in high-stakes decision-making pipelines has introduced a novel and opaque attack surface: reasoning backdoors. In these attacks, the model's intermediate Chain-of-Thought (CoT) is manipulated to provide a linguistically plausible but logically fallacious justification for a malicious conclusion. While frontier models exhibit an intrinsic capacity to detect these fractures, compact, deployable models suffer from a fundamental verification gap, relying on fragile lexical heuristics that are easily bypassed by motivated adversaries. To bridge this gap, we propose TraceGuard, a process-guided security framework that transforms small-scale models into robust reasoning firewalls. Our approach treats the reasoning trace as an untrusted payload and establishes a defense-in-depth strategy through three synergistic phases: (1) Automated Forensic Synthesis, which generates contrastive reasoning pairs to isolate the specific logical point of fracture; (2) Step-Aware Supervised Fine-Tuning (SSFT), to instill a structural verification grammar; and (3) Verifier-Guided Reinforcement Learning (VGRL), utilizing Group Relative Policy Optimization. We identify and mitigate a critical failure mode of baseline alignment - lexical overfitting - whereby verifiers memorize adversarial triggers rather than auditing logical integrity. Our empirical evaluation demonstrates that TraceGuard acts as a security force multiplier: a 4B-parameter verifier achieves forensic precision on unseen attacks - including latent backdoors and post-hoc rationalizations - that rivals architectures two orders of magnitude larger. We further demonstrate robustness against adaptive adversaries in a grey-box setting, establishing TraceGuard as a viable, low-latency security primitive for the Trusted Computing Base.
Key Contributions
- Automated Forensic Synthesis pipeline that generates contrastive reasoning pairs to isolate the specific logical 'point of fracture' in manipulated CoT traces
- Step-Aware Supervised Fine-Tuning (SSFT) + Verifier-Guided Reinforcement Learning (VGRL) using GRPO to train compact verifiers that audit logical structure rather than memorizing adversarial trigger patterns
- Identification and mitigation of 'lexical overfitting' — a critical failure mode where baseline verifiers learn surface-level trigger patterns instead of genuine logical integrity checks
🛡️ Threat Analysis
The paper's primary threat model is 'reasoning backdoors' — training-time backdoors embedded in LRMs that activate hidden malicious behavior through manipulated Chain-of-Thought traces. TraceGuard is a defense that detects latent backdoors and post-hoc CoT rationalizations, directly addressing the backdoor/trojan threat class.