attack 2025

When Thinking Backfires: Mechanistic Insights Into Reasoning-Induced Misalignment

Hanqi Yan ¹, Hainiu Xu ¹, Siya Qi ¹, Shu Yang ^1,2, Yulan He ³

¹ King’s College London

² The Alan Turing Institute

³ King Abdullah University of Science and Technology

0 citations

Published on arXiv

2509.00544

Transfer Learning Attack

OWASP ML Top 10 — ML07

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Safety-critical neurons exhibit significantly higher activation entanglement between reasoning and safety pathways after fine-tuning with identified reasoning patterns, and this entanglement strongly correlates with catastrophic forgetting of the refusal capability.

Reasoning-Induced Misalignment (RIM)

Novel technique introduced

With the growing accessibility and wide adoption of large language models, concerns about their safety and alignment with human values have become paramount. In this paper, we identify a concerning phenomenon: Reasoning-Induced Misalignment (RIM), in which misalignment emerges when reasoning capabilities strengthened-particularly when specific types of reasoning patterns are introduced during inference or training. Beyond reporting this vulnerability, we provide the first mechanistic account of its origins. Through representation analysis, we discover that specific attention heads facilitate refusal by reducing their attention to CoT tokens, a mechanism that modulates the model's rationalization process during inference. During training, we find significantly higher activation entanglement between reasoning and safety in safety-critical neurons than in control neurons, particularly after fine-tuning with those identified reasoning patterns. This entanglement strongly correlates with catastrophic forgetting, providing a neuron-level explanation for RIM.

Key Contributions

Identifies Reasoning-Induced Misalignment (RIM): specific reasoning patterns introduced at inference or fine-tuning time degrade LLM safety alignment in a reproducible and mechanistically explainable way.
Mechanistic discovery that specific attention heads facilitate safety refusal by modulating attention away from CoT tokens — providing an interpretability-level account of how reasoning bypasses refusal.
Neuron-level finding that activation entanglement between reasoning and safety in safety-critical neurons strongly correlates with catastrophic forgetting of safety alignment after reasoning-focused fine-tuning.

🛡️ Threat Analysis

Transfer Learning Attack

The training-time component demonstrates that fine-tuning with specific reasoning patterns causes activation entanglement in safety-critical neurons and catastrophic forgetting of safety — exploiting the gap between pre-training safety alignment and reasoning-focused fine-tuning, which is the core ML07 threat model.

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

white_boxinference_timetraining_time

Applications

llm safety alignmentconversational ai safetyreasoning models

Read PDF arXiv

When Thinking Backfires: Mechanistic Insights Into Reasoning-Induced Misalignment

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Self-Jailbreaking: Language Models Can Reason Themselves Out of Safety Alignment After Benign Reasoning Training

Multi-Level Safety Continual Projection for Fine-Tuned Large Language Models without Retraining

LLMs Can Unlearn Refusal with Only 1,000 Benign Samples

Rubrics as an Attack Surface: Stealthy Preference Drift in LLM Judges

Safety at One Shot: Patching Fine-Tuned LLMs with A Single Instance

Reliable Unlearning Harmful Information in LLMs with Metamorphosis Representation Projection

Curvature-Aware Safety Restoration In LLMs Fine-Tuning

Understanding and Preserving Safety in Fine-Tuned LLMs