defense 2025

In-Training Defenses against Emergent Misalignment in Language Models

David Kaczér 1,2, Magnus Jørgenvåg 1, Clemens Vetter 1, Esha Afzal 3, Robin Haselhorst 3, Lucie Flek 1,2, Florian Mai 1,2

0 citations

α

Published on arXiv

2508.06249

Transfer Learning Attack

OWASP ML Top 10 — ML07

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Identifies practical in-training regularization defenses that reduce emergent misalignment from malicious fine-tuning while preserving benign task performance for fine-tuning API providers.

SafeLoRA

Novel technique introduced


Fine-tuning lets practitioners repurpose aligned large language models (LLMs) for new domains, yet recent work reveals emergent misalignment (EMA): Even a small, domain-specific fine-tune can induce harmful behaviors far outside the target domain. Even in the case where model weights are hidden behind a fine-tuning API, this gives attackers inadvertent access to a broadly misaligned model in a way that can be hard to detect from the fine-tuning data alone. We present the first systematic study of in-training safeguards against EMA that are practical for providers who expose fine-tuning via an API. We investigate four training regularization interventions: (i) KL-divergence regularization toward a safe reference model, (ii) $\ell_2$ distance in feature space, (iii) projecting onto a safe subspace (SafeLoRA), and (iv) interleaving of a small amount of safe training examples from a general instruct-tuning dataset. We first evaluate the methods' emergent misalignment effect across four malicious, EMA-inducing tasks. Second, we assess the methods' impacts on benign tasks. We conclude with a discussion of open questions in emergent misalignment research.


Key Contributions

  • First systematic evaluation of in-training safeguards against emergent misalignment (EMA) for fine-tuning API providers
  • Comparison of four regularization interventions: KL-divergence toward a reference model, ℓ2 feature-space distance, SafeLoRA subspace projection, and safe-example interleaving
  • Dual evaluation of each defense on EMA suppression across four malicious fine-tuning tasks and impact on benign downstream performance

🛡️ Threat Analysis

Transfer Learning Attack

The paper directly addresses how fine-tuning exploits safety alignment — an attacker supplies malicious fine-tuning data to a provider API, inducing broad misalignment (harmful behaviors far outside the target domain). The defenses (KL-divergence regularization, ℓ2 feature-space regularization, SafeLoRA, safe-example interleaving) are in-training safeguards explicitly designed to neutralize this transfer-learning-based attack vector.


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
training_timeblack_box
Applications
llm fine-tuning apislanguage model safety alignment