attack 2025

Path Drift in Large Reasoning Models:How First-Person Commitments Override Safety

Yuyi Huang ^1,2, Runzhe Zhan ², Lidia S.Chao ², Ailin Tao ¹, Derek F.Wong ²

¹ Guangzhou Medical University

² University of Macau

2 citations · 18 references · EMNLP

Published on arXiv

2510.10013

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

The three-stage Path Drift Induction Framework significantly reduces refusal rates in RLHF-aligned LRMs, with combined stages compounding the effect beyond any individual stage alone.

Path Drift Induction Framework

Novel technique introduced

As large language models (LLMs) are increasingly deployed for complex reasoning tasks, Long Chain-of-Thought (Long-CoT) prompting has emerged as a key paradigm for structured inference. Despite early-stage safeguards enabled by alignment techniques such as RLHF, we identify a previously underexplored vulnerability: reasoning trajectories in Long-CoT models can drift from aligned paths, resulting in content that violates safety constraints. We term this phenomenon Path Drift. Through empirical analysis, we uncover three behavioral triggers of Path Drift: (1) first-person commitments that induce goal-driven reasoning that delays refusal signals; (2) ethical evaporation, where surface-level disclaimers bypass alignment checkpoints; (3) condition chain escalation, where layered cues progressively steer models toward unsafe completions. Building on these insights, we introduce a three-stage Path Drift Induction Framework comprising cognitive load amplification, self-role priming, and condition chain hijacking. Each stage independently reduces refusal rates, while their combination further compounds the effect. To mitigate these risks, we propose a path-level defense strategy incorporating role attribution correction and metacognitive reflection (reflective safety cues). Our findings highlight the need for trajectory-level alignment oversight in long-form reasoning beyond token-level alignment.

Key Contributions

Identifies and formalizes 'Path Drift' — a trajectory-level vulnerability in long-CoT LRMs where reasoning paths gradually deviate from aligned behavior through three behavioral triggers
Proposes the three-stage Path Drift Induction Framework (cognitive load amplification, self-role priming, condition chain hijacking) that significantly reduces refusal rates even in RLHF-aligned models
Introduces a path-level defense strategy using role attribution correction and metacognitive reflection to counter reasoning-chain drift

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

inference_timeblack_boxtargeted

Applications

safety-aligned large reasoning modelschain-of-thought llms

Read PDF arXiv DOI

Path Drift in Large Reasoning Models:How First-Person Commitments Override Safety

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Paraphrasing Adversarial Attack on LLM-as-a-Reviewer

Structured Semantic Cloaking for Jailbreak Attacks on Large Language Models

In-Context Representation Hijacking

Turning Logic Against Itself : Probing Model Defenses Through Contrastive Questions

MetaBreak: Jailbreaking Online LLM Services via Special Token Manipulation

Can You Trick the Grader? Adversarial Persuasion of LLM Judges

Trojan Horses in Recruiting: A Red-Teaming Case Study on Indirect Prompt Injection in Standard vs. Reasoning Models

Knowledge-Driven Multi-Turn Jailbreaking on Large Language Models