attack 2026

Unreal Thinking: Chain-of-Thought Hijacking via Two-stage Backdoor

Wenhan Chang 1, Tianqing Zhu 2, Ping Xiong 1, Faqian Guan 2, Wanlei Zhou 2

0 citations

α

Published on arXiv

2604.09235

Model Poisoning

OWASP ML Top 10 — ML10

AI Supply Chain Attacks

OWASP ML Top 10 — ML06

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Successfully induces trigger-activated CoT hijacking across multiple open-weight models while maintaining utility on GSM8K and MMLU benchmarks

TSBH (Two-stage Backdoor Hijacking)

Novel technique introduced


Large Language Models (LLMs) are increasingly deployed in settings where Chain-of-Thought (CoT) is interpreted by users. This creates a new safety risk: attackers may manipulate the model's observable CoT to make malicious behaviors. In open-weight ecosystems, such manipulation can be embedded in lightweight adapters that are easy to distribute and attach to base models. In practice, persistent CoT hijacking faces three main challenges: the difficulty of directly hijacking CoT tokens within one continuous long CoT-output sequence while maintaining stable downstream outputs, the scarcity of malicious CoT data, and the instability of naive backdoor injection methods. To address the data scarcity issue, we propose Multiple Reverse Tree Search (MRTS), a reverse synthesis procedure that constructs output-aligned CoTs from prompt-output pairs without directly eliciting malicious CoTs from aligned models. Building on MRTS, we introduce Two-stage Backdoor Hijacking (TSBH), which first induces a trigger-conditioned mismatch between intermediate CoT and malicious outputs, and then fine-tunes the model on MRTS-generated CoTs that have lower embedding distance to the malicious outputs, thereby ensuring stronger semantic similarity. Experiments across multiple open-weight models demonstrate that our method successfully induces trigger-activated CoT hijacking while maintaining a quantifiable distinction between hijacked and baseline states under our evaluation framework. We further explore a reasoning-based mitigation approach and release a safety-reasoning dataset to support future research on safety-aware and reliable reasoning. Our code is available at https://github.com/ChangWenhan/TSBH_official.


Key Contributions

  • Multiple Reverse Tree Search (MRTS) for synthesizing malicious CoT data without directly eliciting it from aligned models
  • Two-stage Backdoor Hijacking (TSBH) that induces trigger-conditioned mismatch between intermediate reasoning and outputs
  • Demonstrates persistent CoT hijacking via lightweight LoRA adapters distributable in open-weight ecosystems

🛡️ Threat Analysis

AI Supply Chain Attacks

Explicitly addresses supply-chain threat model where attackers distribute lightweight LoRA adapters containing backdoors via model hubs. The attack vector is distributing trojaned adapters that attach to base models in open-weight ecosystems.

Model Poisoning

Core contribution is a two-stage backdoor injection method (TSBH) that embeds trigger-activated malicious behavior in LLM reasoning chains. The attack uses LoRA adapters to insert hidden behavior that activates only with specific triggers, maintaining normal performance otherwise — classic backdoor/trojan attack.


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
training_timeinference_timetargeteddigital
Datasets
GSM8KMMLU
Applications
chain-of-thought reasoningllm safety alignment