attack 2025

Chain-of-Thought Hijacking

3 citations · 61 references · arXiv

Published on arXiv

2510.26418

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

CoT Hijacking achieves attack success rates of 99%, 94%, 100%, and 94% on Gemini 2.5 Pro, ChatGPT o4 Mini, Grok 3 Mini, and Claude 4 Sonnet respectively on HarmBench.

Chain-of-Thought Hijacking

Novel technique introduced

Large Reasoning Models (LRMs) improve task performance through extended inference-time reasoning. While prior work suggests this should strengthen safety, we find evidence to the contrary. Long reasoning sequences can be exploited to systematically weaken them. We introduce Chain-of-Thought Hijacking, a jailbreak attack that prepends harmful instructions with extended sequences of benign puzzle reasoning. Across HarmBench, CoT Hijacking achieves attack success rates of 99\%, 94\%, 100\%, and 94\% on Gemini 2.5 Pro, ChatGPT o4 Mini, Grok 3 Mini, and Claude 4 Sonnet. To understand this mechanism, we apply activation probing, attention analysis, and causal interventions. We find that refusal depends on a low-dimensional safety signal that becomes diluted as reasoning grows: mid-layers encode the strength of safety checking, while late layers encode the refusal outcome. These findings demonstrate that explicit chain-of-thought reasoning introduces a systematic vulnerability when combined with answer-prompting cues. We release all evaluation materials to facilitate replication.

Key Contributions

Chain-of-Thought Hijacking: a jailbreak attack prepending benign puzzle reasoning to harmful prompts, achieving 94–100% attack success rates on Gemini 2.5 Pro, ChatGPT o4 Mini, Grok 3 Mini, and Claude 4 Sonnet
Mechanistic analysis via activation probing, attention analysis, and causal interventions showing that refusal depends on a low-dimensional safety signal that becomes diluted as reasoning chain length grows
Finding that mid-layers encode safety-checking strength and late layers encode refusal outcomes, demonstrating explicit CoT reasoning introduces a systematic vulnerability

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

black_boxinference_time

Datasets

HarmBench

Applications

large reasoning modelsllm safety systems

Read PDF arXiv DOI

Chain-of-Thought Hijacking

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

ToxSearch: Evolving Prompts for Toxicity Search in Large Language Models

SearchAttack: Red-Teaming LLMs against Knowledge-to-Action Threats under Online Web Search

Exposing Long-Tail Safety Failures in Large Language Models through Efficient Diverse Response Sampling

Jailbreaking in the Haystack

Special-Character Adversarial Attacks on Open-Source Language Model

In-Context Representation Hijacking

ICL-EVADER: Zero-Query Black-Box Evasion Attacks on In-Context Learning and Their Defenses

SpatialJB: How Text Distribution Art Becomes the "Jailbreak Key" for LLM Guardrails