When Thinking Backfires: Mechanistic Insights Into Reasoning-Induced Misalignment
Hanqi Yan 1, Hainiu Xu 1, Siya Qi 1, Shu Yang 1,2, Yulan He 3
Published on arXiv
2509.00544
Transfer Learning Attack
OWASP ML Top 10 — ML07
Prompt Injection
OWASP LLM Top 10 — LLM01
Key Finding
Safety-critical neurons exhibit significantly higher activation entanglement between reasoning and safety pathways after fine-tuning with identified reasoning patterns, and this entanglement strongly correlates with catastrophic forgetting of the refusal capability.
Reasoning-Induced Misalignment (RIM)
Novel technique introduced
With the growing accessibility and wide adoption of large language models, concerns about their safety and alignment with human values have become paramount. In this paper, we identify a concerning phenomenon: Reasoning-Induced Misalignment (RIM), in which misalignment emerges when reasoning capabilities strengthened-particularly when specific types of reasoning patterns are introduced during inference or training. Beyond reporting this vulnerability, we provide the first mechanistic account of its origins. Through representation analysis, we discover that specific attention heads facilitate refusal by reducing their attention to CoT tokens, a mechanism that modulates the model's rationalization process during inference. During training, we find significantly higher activation entanglement between reasoning and safety in safety-critical neurons than in control neurons, particularly after fine-tuning with those identified reasoning patterns. This entanglement strongly correlates with catastrophic forgetting, providing a neuron-level explanation for RIM.
Key Contributions
- Identifies Reasoning-Induced Misalignment (RIM): specific reasoning patterns introduced at inference or fine-tuning time degrade LLM safety alignment in a reproducible and mechanistically explainable way.
- Mechanistic discovery that specific attention heads facilitate safety refusal by modulating attention away from CoT tokens — providing an interpretability-level account of how reasoning bypasses refusal.
- Neuron-level finding that activation entanglement between reasoning and safety in safety-critical neurons strongly correlates with catastrophic forgetting of safety alignment after reasoning-focused fine-tuning.
🛡️ Threat Analysis
The training-time component demonstrates that fine-tuning with specific reasoning patterns causes activation entanglement in safety-critical neurons and catastrophic forgetting of safety — exploiting the gap between pre-training safety alignment and reasoning-focused fine-tuning, which is the core ML07 threat model.