defense 2026

RASA: Routing-Aware Safety Alignment for Mixture-of-Experts Models

Firstname1 Lastname1 1, Firstname2 Lastname2 1,2, Firstname3 Lastname3 2, Firstname4 Lastname4 3, Firstname5 Lastname5 1, Firstname6 Lastname6 3,1,2, Firstname7 Lastname7 2, Firstname8 Lastname8 3, Firstname8 Lastname8 1,2

0 citations · 40 references · arXiv (Cornell University)

α

Published on arXiv

2602.04448

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

RASA achieves near-perfect robustness against diverse jailbreak attacks across two MoE architectures while substantially reducing over-refusal and preserving general capabilities on MMLU, GSM8K, and TruthfulQA.

RASA

Novel technique introduced


Mixture-of-Experts (MoE) language models introduce unique challenges for safety alignment due to their sparse routing mechanisms, which can enable degenerate optimization behaviors under standard full-parameter fine-tuning. In our preliminary experiments, we observe that naively applying full-parameter safety fine-tuning to MoE models can reduce attack success rates through routing or expert dominance effects, rather than by directly repairing Safety-Critical Experts. To address this challenge, we propose RASA, a routing-aware expert-level alignment framework that explicitly repairs Safety-Critical Experts while preventing routing-based bypasses. RASA identifies experts disproportionately activated by successful jailbreaks, selectively fine-tunes only these experts under fixed routing, and subsequently enforces routing consistency with safety-aligned contexts. Across two representative MoE architectures and a diverse set of jailbreak attacks, RASA achieves near-perfect robustness, strong cross-attack generalization, and substantially reduced over-refusal, while preserving general capabilities on benchmarks such as MMLU, GSM8K, and TruthfulQA. Our results suggest that robust MoE safety alignment benefits from targeted expert repair rather than global parameter updates, offering a practical and architecture-preserving alternative to prior approaches.


Key Contributions

  • Identifies Safety-Critical Experts disproportionately activated by successful jailbreaks in MoE LLMs using routing analysis
  • Selectively fine-tunes only safety-critical experts under fixed routing to directly repair jailbreak vulnerabilities rather than applying global parameter updates
  • Enforces routing consistency with safety-aligned contexts to prevent routing-based jailbreak bypasses, achieving near-perfect robustness with reduced over-refusal

🛡️ Threat Analysis


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
inference_timetargeted
Datasets
MMLUGSM8KTruthfulQA
Applications
llm safety alignmentmixture-of-experts language models