Chain-of-Thought Hijacking
Jianli Zhao 1, Tingchen Fu 1, Rylan Schaeffer 2, Mrinank Sharma 3, Fazl Barez 4,5
Published on arXiv
2510.26418
Prompt Injection
OWASP LLM Top 10 — LLM01
Key Finding
CoT Hijacking achieves attack success rates of 99%, 94%, 100%, and 94% on Gemini 2.5 Pro, ChatGPT o4 Mini, Grok 3 Mini, and Claude 4 Sonnet respectively on HarmBench.
Chain-of-Thought Hijacking
Novel technique introduced
Large Reasoning Models (LRMs) improve task performance through extended inference-time reasoning. While prior work suggests this should strengthen safety, we find evidence to the contrary. Long reasoning sequences can be exploited to systematically weaken them. We introduce Chain-of-Thought Hijacking, a jailbreak attack that prepends harmful instructions with extended sequences of benign puzzle reasoning. Across HarmBench, CoT Hijacking achieves attack success rates of 99\%, 94\%, 100\%, and 94\% on Gemini 2.5 Pro, ChatGPT o4 Mini, Grok 3 Mini, and Claude 4 Sonnet. To understand this mechanism, we apply activation probing, attention analysis, and causal interventions. We find that refusal depends on a low-dimensional safety signal that becomes diluted as reasoning grows: mid-layers encode the strength of safety checking, while late layers encode the refusal outcome. These findings demonstrate that explicit chain-of-thought reasoning introduces a systematic vulnerability when combined with answer-prompting cues. We release all evaluation materials to facilitate replication.
Key Contributions
- Chain-of-Thought Hijacking: a jailbreak attack prepending benign puzzle reasoning to harmful prompts, achieving 94–100% attack success rates on Gemini 2.5 Pro, ChatGPT o4 Mini, Grok 3 Mini, and Claude 4 Sonnet
- Mechanistic analysis via activation probing, attention analysis, and causal interventions showing that refusal depends on a low-dimensional safety signal that becomes diluted as reasoning chain length grows
- Finding that mid-layers encode safety-checking strength and late layers encode refusal outcomes, demonstrating explicit CoT reasoning introduces a systematic vulnerability