attack 2025

One Token Embedding Is Enough to Deadlock Your Large Reasoning Model

Mohan Zhang 1, Yihua Zhang 2, Jinghan Jia 2, Zhangyang Wang 3, Sijia Liu 2, Tianlong Chen 1

1 citations · 74 references · arXiv

α

Published on arXiv

2510.15965

Model Poisoning

OWASP ML Top 10 — ML10

Model Denial of Service

OWASP LLM Top 10 — LLM04

Key Finding

Achieves 100% attack success rate across four advanced LRMs on three math benchmarks, forcing models to exhaust their maximum token budget on every triggered query while remaining stealthy on benign inputs.

Deadlock Attack

Novel technique introduced


Modern large reasoning models (LRMs) exhibit impressive multi-step problem-solving via chain-of-thought (CoT) reasoning. However, this iterative thinking mechanism introduces a new vulnerability surface. We present the Deadlock Attack, a resource exhaustion method that hijacks an LRM's generative control flow by training a malicious adversarial embedding to induce perpetual reasoning loops. Specifically, the optimized embedding encourages transitional tokens (e.g., "Wait", "But") after reasoning steps, preventing the model from concluding its answer. A key challenge we identify is the continuous-to-discrete projection gap: naïve projections of adversarial embeddings to token sequences nullify the attack. To overcome this, we introduce a backdoor implantation strategy, enabling reliable activation through specific trigger tokens. Our method achieves a 100% attack success rate across four advanced LRMs (Phi-RM, Nemotron-Nano, R1-Qwen, R1-Llama) and three math reasoning benchmarks, forcing models to generate up to their maximum token limits. The attack is also stealthy (in terms of causing negligible utility loss on benign user inputs) and remains robust against existing strategies trying to mitigate the overthinking issue. Our findings expose a critical and underexplored security vulnerability in LRMs from the perspective of reasoning (in)efficiency.


Key Contributions

  • Deadlock Attack: adversarial embedding optimization that induces perpetual chain-of-thought reasoning loops in LRMs, preventing conclusion generation
  • Identification of the continuous-to-discrete projection gap as the key obstacle to practical deployment, and a backdoor implantation strategy to overcome it via trigger tokens
  • Empirical demonstration of 100% attack success rate across four LRMs (Phi-RM, Nemotron-Nano, R1-Qwen, R1-Llama) with negligible benign utility loss and robustness against anti-overthinking mitigations

🛡️ Threat Analysis

Model Poisoning

The attack's core mechanism is a backdoor implantation strategy: the model is fine-tuned so that specific trigger tokens reliably activate the deadlock behavior (perpetual reasoning loops). The model behaves normally on benign inputs and maliciously only when the trigger is present — a textbook backdoor/trojan.


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
white_boxtraining_timetargeteddigital
Datasets
MATHAMCAIME
Applications
large reasoning modelsmath reasoning