attack 2025

When Thinking Backfires: Mechanistic Insights Into Reasoning-Induced Misalignment

Hanqi Yan 1, Hainiu Xu 1, Siya Qi 1, Shu Yang 1,2, Yulan He 3

0 citations

α

Published on arXiv

2509.00544

Transfer Learning Attack

OWASP ML Top 10 — ML07

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Safety-critical neurons exhibit significantly higher activation entanglement between reasoning and safety pathways after fine-tuning with identified reasoning patterns, and this entanglement strongly correlates with catastrophic forgetting of the refusal capability.

Reasoning-Induced Misalignment (RIM)

Novel technique introduced


With the growing accessibility and wide adoption of large language models, concerns about their safety and alignment with human values have become paramount. In this paper, we identify a concerning phenomenon: Reasoning-Induced Misalignment (RIM), in which misalignment emerges when reasoning capabilities strengthened-particularly when specific types of reasoning patterns are introduced during inference or training. Beyond reporting this vulnerability, we provide the first mechanistic account of its origins. Through representation analysis, we discover that specific attention heads facilitate refusal by reducing their attention to CoT tokens, a mechanism that modulates the model's rationalization process during inference. During training, we find significantly higher activation entanglement between reasoning and safety in safety-critical neurons than in control neurons, particularly after fine-tuning with those identified reasoning patterns. This entanglement strongly correlates with catastrophic forgetting, providing a neuron-level explanation for RIM.


Key Contributions

  • Identifies Reasoning-Induced Misalignment (RIM): specific reasoning patterns introduced at inference or fine-tuning time degrade LLM safety alignment in a reproducible and mechanistically explainable way.
  • Mechanistic discovery that specific attention heads facilitate safety refusal by modulating attention away from CoT tokens — providing an interpretability-level account of how reasoning bypasses refusal.
  • Neuron-level finding that activation entanglement between reasoning and safety in safety-critical neurons strongly correlates with catastrophic forgetting of safety alignment after reasoning-focused fine-tuning.

🛡️ Threat Analysis

Transfer Learning Attack

The training-time component demonstrates that fine-tuning with specific reasoning patterns causes activation entanglement in safety-critical neurons and catastrophic forgetting of safety — exploiting the gap between pre-training safety alignment and reasoning-focused fine-tuning, which is the core ML07 threat model.


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
white_boxinference_timetraining_time
Applications
llm safety alignmentconversational ai safetyreasoning models