survey 2025

Rethinking Reasoning: A Survey on Reasoning-based Backdoors in LLMs

Man Hu 1, Xinyi Wu 2, Zuofeng Suo 3, Jinbo Feng 1, Linghui Meng 1, Yanhao Jia 2, Anh Tuan Luu 2, Shuai Zhao 2

0 citations · 96 references · arXiv

α

Published on arXiv

2510.07697

Model Poisoning

OWASP ML Top 10 — ML10

Key Finding

Proposes the first taxonomy of reasoning-based backdoor attacks in LLMs, distinguishing associative, passive, and active attack types based on how they corrupt the model's reasoning process


With the rise of advanced reasoning capabilities, large language models (LLMs) are receiving increasing attention. However, although reasoning improves LLMs' performance on downstream tasks, it also introduces new security risks, as adversaries can exploit these capabilities to conduct backdoor attacks. Existing surveys on backdoor attacks and reasoning security offer comprehensive overviews but lack in-depth analysis of backdoor attacks and defenses targeting LLMs' reasoning abilities. In this paper, we take the first step toward providing a comprehensive review of reasoning-based backdoor attacks in LLMs by analyzing their underlying mechanisms, methodological frameworks, and unresolved challenges. Specifically, we introduce a new taxonomy that offers a unified perspective for summarizing existing approaches, categorizing reasoning-based backdoor attacks into associative, passive, and active. We also present defense strategies against such attacks and discuss current challenges alongside potential directions for future research. This work offers a novel perspective, paving the way for further exploration of secure and trustworthy LLM communities.


Key Contributions

  • First comprehensive survey dedicated to reasoning-based backdoor attacks in LLMs, covering mechanisms, methodological frameworks, and unresolved challenges
  • Novel cognition-centric taxonomy categorizing reasoning-based backdoor attacks into three types: associative, passive, and active
  • Review of defense strategies against reasoning-based backdoor attacks and discussion of future research directions

🛡️ Threat Analysis

Model Poisoning

The paper surveys backdoor/trojan attacks specifically designed to hijack LLM reasoning processes (Chain-of-Thought, multi-step reasoning), embedding hidden triggers that corrupt the model's cognitive process and steer outputs toward adversary-specified malicious results.


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
training_timetargeted
Applications
large language modelschain-of-thought reasoningmathematical reasoningcode generation