survey arXiv Oct 9, 2025 · Oct 2025
Man Hu, Xinyi Wu, Zuofeng Suo et al. · Beijing Electronic Science and Technology Institute · Nanyang Technological University +1 more
First survey on backdoor attacks targeting LLM reasoning processes, proposing a three-type taxonomy of associative, passive, and active backdoors
Model Poisoning nlp
With the rise of advanced reasoning capabilities, large language models (LLMs) are receiving increasing attention. However, although reasoning improves LLMs' performance on downstream tasks, it also introduces new security risks, as adversaries can exploit these capabilities to conduct backdoor attacks. Existing surveys on backdoor attacks and reasoning security offer comprehensive overviews but lack in-depth analysis of backdoor attacks and defenses targeting LLMs' reasoning abilities. In this paper, we take the first step toward providing a comprehensive review of reasoning-based backdoor attacks in LLMs by analyzing their underlying mechanisms, methodological frameworks, and unresolved challenges. Specifically, we introduce a new taxonomy that offers a unified perspective for summarizing existing approaches, categorizing reasoning-based backdoor attacks into associative, passive, and active. We also present defense strategies against such attacks and discuss current challenges alongside potential directions for future research. This work offers a novel perspective, paving the way for further exploration of secure and trustworthy LLM communities.
llm transformer Beijing Electronic Science and Technology Institute · Nanyang Technological University · Hainan University
defense arXiv Oct 6, 2025 · Oct 2025
Shuai Zhao, Xinyi Wu, Shiqian Zhao et al. · Nanyang Technological University · Shanghai Jiao Tong University
Defends LLMs from fine-tuning backdoor attacks by re-poisoning training data with benign triggers and safe labels
Model Poisoning nlpmultimodal
During fine-tuning, large language models (LLMs) are increasingly vulnerable to data-poisoning backdoor attacks, which compromise their reliability and trustworthiness. However, existing defense strategies suffer from limited generalization: they only work on specific attack types or task settings. In this study, we propose Poison-to-Poison (P2P), a general and effective backdoor defense algorithm. P2P injects benign triggers with safe alternative labels into a subset of training samples and fine-tunes the model on this re-poisoned dataset by leveraging prompt-based learning. This enforces the model to associate trigger-induced representations with safe outputs, thereby overriding the effects of original malicious triggers. Thanks to this robust and generalizable trigger-based fine-tuning, P2P is effective across task settings and attack types. Theoretically and empirically, we show that P2P can neutralize malicious backdoors while preserving task performance. We conduct extensive experiments on classification, mathematical reasoning, and summary generation tasks, involving multiple state-of-the-art LLMs. The results demonstrate that our P2P algorithm significantly reduces the attack success rate compared with baseline models. We hope that the P2P can serve as a guideline for defending against backdoor attacks and foster the development of a secure and trustworthy LLM community.
llm transformer Nanyang Technological University · Shanghai Jiao Tong University