Critical-CoT: A Robust Defense Framework against Reasoning-Level Backdoor Attacks in Large Language Models

Large Language Models (LLMs), despite their impressive capabilities across domains, have been shown to be vulnerable to backdoor attacks. Prior backdoor strategies predominantly operate at the token level, where an injected trigger causes the model to generate a specific target word, choice, or class (depending on the task). Recent advances, however, exploit the long-form reasoning tendencies of modern LLMs to conduct reasoning-level backdoors: once triggered, the victim model inserts one or more malicious reasoning steps into its chain-of-thought (CoT). These attacks are substantially harder to detect, as the backdoored answer remains plausible and consistent with the poisoned reasoning trajectory. Yet, defenses tailored to this type of backdoor remain largely unexplored. To bridge this gap, we propose Critical-CoT, a novel defense mechanism that conducts a two-stage fine-tuning (FT) process on LLMs to develop critical thinking behaviors, enabling them to automatically identify potential backdoors and refuse to generate malicious reasoning steps. Extensive experiments across multiple LLMs and datasets demonstrate that Critical-CoT provides strong robustness against both in-context learning-based and FT-based backdoor attacks. Notably, Critical-CoT exhibits strong cross-domain and cross-task generalization. Our code is available at hthttps://github.com/tuanvu171/Critical-CoT.

Key Contributions

Novel two-stage fine-tuning defense (SFT + DPO) that teaches LLMs critical thinking to identify backdoor triggers
First unified defense effective against both ICL-based and FT-based reasoning-level backdoor attacks
Strong cross-domain and cross-task generalization without requiring prior knowledge of triggers or attack strategies

🛡️ Threat Analysis

Model Poisoning

Primary focus is defending against backdoor attacks that inject malicious reasoning steps into LLM chain-of-thought outputs via both fine-tuning and in-context learning poisoning.

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

training_timeinference_time

Applications

2025 0 cit.

Model Poisoning

83%

Critical-CoT: A Robust Defense Framework against Reasoning-Level Backdoor Attacks in Large Language Models

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

From Poisoned to Aware: Fostering Backdoor Self-Awareness in LLMs

Detecting Sleeper Agents in Large Language Models via Semantic Drift Analysis

Backdoor Samples Detection Based on Perturbation Discrepancy Consistency in Pre-trained Language Models

Localizing Malicious Outputs from CodeLLM

DUP: Detection-guided Unlearning for Backdoor Purification in Language Models

BadLLM-TG: A Backdoor Defender powered by LLM Trigger Generator

Uncovering and Aligning Anomalous Attention Heads to Defend Against NLP Backdoor Attacks

The Double Life of Code World Models: Provably Unmasking Malicious Behavior Through Execution Traces