In-Training Defenses against Emergent Misalignment in Language Models
David Kaczér 1,2, Magnus Jørgenvåg 1, Clemens Vetter 1, Esha Afzal 3, Robin Haselhorst 3, Lucie Flek 1,2, Florian Mai 1,2
Published on arXiv
2508.06249
Transfer Learning Attack
OWASP ML Top 10 — ML07
Prompt Injection
OWASP LLM Top 10 — LLM01
Key Finding
Identifies practical in-training regularization defenses that reduce emergent misalignment from malicious fine-tuning while preserving benign task performance for fine-tuning API providers.
SafeLoRA
Novel technique introduced
Fine-tuning lets practitioners repurpose aligned large language models (LLMs) for new domains, yet recent work reveals emergent misalignment (EMA): Even a small, domain-specific fine-tune can induce harmful behaviors far outside the target domain. Even in the case where model weights are hidden behind a fine-tuning API, this gives attackers inadvertent access to a broadly misaligned model in a way that can be hard to detect from the fine-tuning data alone. We present the first systematic study of in-training safeguards against EMA that are practical for providers who expose fine-tuning via an API. We investigate four training regularization interventions: (i) KL-divergence regularization toward a safe reference model, (ii) $\ell_2$ distance in feature space, (iii) projecting onto a safe subspace (SafeLoRA), and (iv) interleaving of a small amount of safe training examples from a general instruct-tuning dataset. We first evaluate the methods' emergent misalignment effect across four malicious, EMA-inducing tasks. Second, we assess the methods' impacts on benign tasks. We conclude with a discussion of open questions in emergent misalignment research.
Key Contributions
- First systematic evaluation of in-training safeguards against emergent misalignment (EMA) for fine-tuning API providers
- Comparison of four regularization interventions: KL-divergence toward a reference model, ℓ2 feature-space distance, SafeLoRA subspace projection, and safe-example interleaving
- Dual evaluation of each defense on EMA suppression across four malicious fine-tuning tasks and impact on benign downstream performance
🛡️ Threat Analysis
The paper directly addresses how fine-tuning exploits safety alignment — an attacker supplies malicious fine-tuning data to a provider API, inducing broad misalignment (harmful behaviors far outside the target domain). The defenses (KL-divergence regularization, ℓ2 feature-space regularization, SafeLoRA, safe-example interleaving) are in-training safeguards explicitly designed to neutralize this transfer-learning-based attack vector.