One Token Embedding Is Enough to Deadlock Your Large Reasoning Model
Mohan Zhang 1, Yihua Zhang 2, Jinghan Jia 2, Zhangyang Wang 3, Sijia Liu 2, Tianlong Chen 1
Published on arXiv
2510.15965
Model Poisoning
OWASP ML Top 10 — ML10
Model Denial of Service
OWASP LLM Top 10 — LLM04
Key Finding
Achieves 100% attack success rate across four advanced LRMs on three math benchmarks, forcing models to exhaust their maximum token budget on every triggered query while remaining stealthy on benign inputs.
Deadlock Attack
Novel technique introduced
Modern large reasoning models (LRMs) exhibit impressive multi-step problem-solving via chain-of-thought (CoT) reasoning. However, this iterative thinking mechanism introduces a new vulnerability surface. We present the Deadlock Attack, a resource exhaustion method that hijacks an LRM's generative control flow by training a malicious adversarial embedding to induce perpetual reasoning loops. Specifically, the optimized embedding encourages transitional tokens (e.g., "Wait", "But") after reasoning steps, preventing the model from concluding its answer. A key challenge we identify is the continuous-to-discrete projection gap: naïve projections of adversarial embeddings to token sequences nullify the attack. To overcome this, we introduce a backdoor implantation strategy, enabling reliable activation through specific trigger tokens. Our method achieves a 100% attack success rate across four advanced LRMs (Phi-RM, Nemotron-Nano, R1-Qwen, R1-Llama) and three math reasoning benchmarks, forcing models to generate up to their maximum token limits. The attack is also stealthy (in terms of causing negligible utility loss on benign user inputs) and remains robust against existing strategies trying to mitigate the overthinking issue. Our findings expose a critical and underexplored security vulnerability in LRMs from the perspective of reasoning (in)efficiency.
Key Contributions
- Deadlock Attack: adversarial embedding optimization that induces perpetual chain-of-thought reasoning loops in LRMs, preventing conclusion generation
- Identification of the continuous-to-discrete projection gap as the key obstacle to practical deployment, and a backdoor implantation strategy to overcome it via trigger tokens
- Empirical demonstration of 100% attack success rate across four LRMs (Phi-RM, Nemotron-Nano, R1-Qwen, R1-Llama) with negligible benign utility loss and robustness against anti-overthinking mitigations
🛡️ Threat Analysis
The attack's core mechanism is a backdoor implantation strategy: the model is fine-tuned so that specific trigger tokens reliably activate the deadlock behavior (perpetual reasoning loops). The model behaves normally on benign inputs and maliciously only when the trigger is present — a textbook backdoor/trojan.