attack 2025

BadThink: Triggered Overthinking Attacks on Chain-of-Thought Reasoning in Large Language Models

Shuaitong Liu 1, Renjue Li 2, Lijia Yu 2, Lijun Zhang 2, Zhiming Liu 1, Gaojie Jin 3

1 citations · arXiv

α

Published on arXiv

2511.10714

Model Poisoning

OWASP ML Top 10 — ML10

Model Denial of Service

OWASP LLM Top 10 — LLM04

Key Finding

BadThink achieves over 17x increase in CoT reasoning trace length on MATH-500 while preserving output correctness and evading output-accuracy-based detection.

BadThink

Novel technique introduced


Recent advances in Chain-of-Thought (CoT) prompting have substantially improved the reasoning capabilities of large language models (LLMs), but have also introduced their computational efficiency as a new attack surface. In this paper, we propose BadThink, the first backdoor attack designed to deliberately induce "overthinking" behavior in CoT-enabled LLMs while ensuring stealth. When activated by carefully crafted trigger prompts, BadThink manipulates the model to generate inflated reasoning traces - producing unnecessarily redundant thought processes while preserving the consistency of final outputs. This subtle attack vector creates a covert form of performance degradation that significantly increases computational costs and inference time while remaining difficult to detect through conventional output evaluation methods. We implement this attack through a sophisticated poisoning-based fine-tuning strategy, employing a novel LLM-based iterative optimization process to embed the behavior by generating highly naturalistic poisoned data. Our experiments on multiple state-of-the-art models and reasoning tasks show that BadThink consistently increases reasoning trace lengths - achieving an over 17x increase on the MATH-500 dataset - while remaining stealthy and robust. This work reveals a critical, previously unexplored vulnerability where reasoning efficiency can be covertly manipulated, demonstrating a new class of sophisticated attacks against CoT-enabled systems.


Key Contributions

  • First backdoor attack (BadThink) specifically designed to induce stealthy overthinking in CoT-enabled LLMs via training-time data poisoning
  • Novel LLM-based iterative optimization pipeline for generating highly naturalistic poisoned CoT traces that evade stylometric detection
  • Empirical demonstration of 17x+ reasoning trace length inflation on MATH-500 across multiple LLMs while preserving output correctness

🛡️ Threat Analysis

Model Poisoning

BadThink is a training-time backdoor attack that embeds hidden overthinking behavior directly into model weights via data poisoning; the behavior activates only upon a crafted trigger prompt and remains dormant otherwise — a textbook backdoor/trojan.


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
training_timetargeted
Datasets
MATH-500
Applications
chain-of-thought reasoningllm inference