defense 2025

From Narrow Unlearning to Emergent Misalignment: Causes, Consequences, and Containment in LLMs

Erum Mushtaq ^1,2, Anil Ramakrishna ², Satyapriya Krishna ², Sattvik Sahai ², Prasoon Goyal ², Kai-Wei Chang ², Tao Zhang ², Rahul Gupta ²

¹ University of Southern California

² Amazon AGI

3 citations · arXiv

Published on arXiv

2511.14017

Transfer Learning Attack

OWASP ML Top 10 — ML07

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Safety concept unlearning causes larger emergent misalignment than Cybersecurity unlearning across unrelated domains, and augmented unlearning with retain data largely restores alignment across impacted RAI domains on both Mistral-7b-0.3v and Qwen-7b-2.5.

Augmented Refusal Unlearning

Novel technique introduced

Recent work has shown that fine-tuning on insecure code data can trigger an emergent misalignment (EMA) phenomenon, where models generate malicious responses even to prompts unrelated to the original insecure code-writing task. Such cross-domain generalization of harmful behavior underscores the need for a deeper understanding of the algorithms, tasks, and datasets that induce emergent misalignment. In this work, we extend this study by demonstrating that emergent misalignment can also arise from narrow refusal unlearning in specific domains. We perform refusal unlearning on Cybersecurity and Safety concept, and evaluate EMA by monitoring refusal scores across seven responsible AI (RAI) domains, Cybersecurity, Safety, Toxicity, Bias, Sensitive Content, Medical/Legal, and Privacy. Our work shows that narrow domain unlearning can yield compliance responses for the targeted concept, however, it may also propagate EMA to unrelated domains. Among the two intervened concepts, Cybersecurity and Safety, we find that the safety concept can have larger EMA impact, i.e, causing lower refusal scores, across other unrelated domains such as bias. We observe this effect consistently across two model families, Mistral-7b-0.3v, and Qwen-7b-2.5. Further, we show that refusal unlearning augmented with cross-entropy loss function on a small set of retain data from the affected domains can largely, if not fully, restore alignment across the impacted domains while having lower refusal rate on the concept we perform unlearning on. To investigate the underlying causes of EMA, we analyze concept entanglements at the representation level via concept vectors. Our analysis reveals that concepts with higher representation similarity in earlier layers are more susceptible to EMA after intervention when the refusal stream is altered through targeted refusal unlearning.

Key Contributions

Demonstrates that narrow refusal unlearning on Safety or Cybersecurity concepts causes emergent misalignment (EMA) in unrelated RAI domains such as Bias and Toxicity, consistent across Mistral-7b and Qwen-7b families
Proposes an augmented unlearning defense using cross-entropy loss on a small set of retain data from affected domains to largely restore alignment while preserving the targeted compliance
Analyzes concept entanglement via concept vectors, finding that concepts with higher representational similarity in earlier layers are more susceptible to EMA when the refusal stream is altered

🛡️ Threat Analysis

Transfer Learning Attack

The attack mechanism is refusal unlearning — a targeted fine-tuning intervention that removes safety behavior — which then propagates misalignment across unrelated domains. This exploits the fine-tuning/unlearning process to undermine safety alignment, fitting ML07's 'attacks that exploit fine-tuning or RLHF' scope.

Details

Domains

nlp

Model Types

llm

Threat Tags

training_time

Datasets

custom RAI domain datasets (Cybersecurity, Safety, Toxicity, Bias, Sensitive Content, Medical/Legal, Privacy)

Applications

large language model safetyresponsible ai alignment

Read PDF arXiv DOI

From Narrow Unlearning to Emergent Misalignment: Causes, Consequences, and Containment in LLMs

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Safeguarding LLM Fine-tuning via Push-Pull Distributional Alignment

SaLoRA: Safety-Alignment Preserved Low-Rank Adaptation

Anchoring Refusal Direction: Mitigating Safety Risks in Tuning via Projection Constraint

Token-level Data Selection for Safe LLM Fine-tuning

LSSF: Safety Alignment for Large Language Models through Low-Rank Safety Subspace Fusion

Few Tokens, Big Leverage: Preserving Safety Alignment by Constraining Safety Tokens during Fine-tuning

GR-SAP: Generative Replay for Safety Alignment Preservation during Fine-Tuning

Understanding and Preserving Safety in Fine-Tuned LLMs