defense 2025

Defending MoE LLMs against Harmful Fine-Tuning via Safety Routing Alignment

Jaehan Kim , Minkyoo Song , Seungwon Shin , Sooel Son

3 citations · 1 influential · 43 references · arXiv

α

Published on arXiv

2509.22745

Transfer Learning Attack

OWASP ML Top 10 — ML07

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

SafeMoE reduces OLMoE harmfulness score from 62.0 to 5.0 after harmful fine-tuning while maintaining task utility within 1% degradation and incurring only 2% computational overhead.

SafeMoE

Novel technique introduced


Recent large language models (LLMs) have increasingly adopted the Mixture-of-Experts (MoE) architecture for efficiency. MoE-based LLMs heavily depend on a superficial safety mechanism in which harmful inputs are routed safety-critical experts. However, our analysis reveals that routing decisions for harmful inputs drift significantly after fine-tuning, exposing a critical vulnerability to harmful fine-tuning (HFT) attacks. Existing defenses, primarily designed for monolithic LLMs, are less effective for MoE LLMs as they fail to prevent drift in harmful input routing. To address this limitation, we propose SafeMoE, a safe fine-tuning method tailored to MoE LLMs. SafeMoE directly mitigates routing drift by penalizing the gap between the routing weights of a fine-tuned model and those of the initial safety-aligned model, thereby preserving the safety-aligned routing of harmful inputs to safety-critical experts. Experiments on open-source MoE LLMs ranging from 7B to 141B parameters demonstrate that SafeMoE effectively mitigates HFT attacks, reducing the harmfulness score of OLMoE from 62.0 to 5.0, for example, while maintaining task utility within 1% degradation and incurring only 2% overhead. It significantly outperforms state-of-the-art defense methods for safeguarding LLM fine-tuning and remains effective in recent large-scale MoE LLMs such as gpt-oss and Llama 4. Our implementation is available at https://anonymous.4open.science/r/SafeMoE.


Key Contributions

  • Identifies routing drift in MoE LLMs as the primary vulnerability enabling harmful fine-tuning attacks to bypass safety alignment
  • Proposes SafeMoE, which penalizes divergence between fine-tuned and safety-aligned routing weights to preserve safety-critical expert routing
  • Demonstrates effectiveness across MoE LLMs from 7B to 141B parameters (OLMoE, gpt-oss, Llama 4) with <1% utility degradation and 2% compute overhead

🛡️ Threat Analysis

Transfer Learning Attack

The paper directly targets harmful fine-tuning (HFT) attacks, where an adversary fine-tunes a safety-aligned MoE LLM on harmful data to erode its safety constraints — a classic transfer/fine-tuning exploitation attack. SafeMoE defends by preserving routing alignment between the fine-tuned and safety-aligned model, specifically addressing the gap between pre-training safety alignment and subsequent fine-tuning.


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
training_timetargeted
Datasets
OLMoELlama 4gpt-oss
Applications
llm safety alignmentfine-tuning safetymoe llms