attack 2025

Chain-of-Thought Hijacking

Jianli Zhao 1, Tingchen Fu 1, Rylan Schaeffer 2, Mrinank Sharma 3, Fazl Barez 4,5

3 citations · 61 references · arXiv

α

Published on arXiv

2510.26418

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

CoT Hijacking achieves attack success rates of 99%, 94%, 100%, and 94% on Gemini 2.5 Pro, ChatGPT o4 Mini, Grok 3 Mini, and Claude 4 Sonnet respectively on HarmBench.

Chain-of-Thought Hijacking

Novel technique introduced


Large Reasoning Models (LRMs) improve task performance through extended inference-time reasoning. While prior work suggests this should strengthen safety, we find evidence to the contrary. Long reasoning sequences can be exploited to systematically weaken them. We introduce Chain-of-Thought Hijacking, a jailbreak attack that prepends harmful instructions with extended sequences of benign puzzle reasoning. Across HarmBench, CoT Hijacking achieves attack success rates of 99\%, 94\%, 100\%, and 94\% on Gemini 2.5 Pro, ChatGPT o4 Mini, Grok 3 Mini, and Claude 4 Sonnet. To understand this mechanism, we apply activation probing, attention analysis, and causal interventions. We find that refusal depends on a low-dimensional safety signal that becomes diluted as reasoning grows: mid-layers encode the strength of safety checking, while late layers encode the refusal outcome. These findings demonstrate that explicit chain-of-thought reasoning introduces a systematic vulnerability when combined with answer-prompting cues. We release all evaluation materials to facilitate replication.


Key Contributions

  • Chain-of-Thought Hijacking: a jailbreak attack prepending benign puzzle reasoning to harmful prompts, achieving 94–100% attack success rates on Gemini 2.5 Pro, ChatGPT o4 Mini, Grok 3 Mini, and Claude 4 Sonnet
  • Mechanistic analysis via activation probing, attention analysis, and causal interventions showing that refusal depends on a low-dimensional safety signal that becomes diluted as reasoning chain length grows
  • Finding that mid-layers encode safety-checking strength and late layers encode refusal outcomes, demonstrating explicit CoT reasoning introduces a systematic vulnerability

🛡️ Threat Analysis


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
black_boxinference_time
Datasets
HarmBench
Applications
large reasoning modelsllm safety systems