defense 2026

RASA: Routing-Aware Safety Alignment for Mixture-of-Experts Models

Firstname1 Lastname1 ¹, Firstname2 Lastname2 ^1,2, Firstname3 Lastname3 ², Firstname4 Lastname4 ³, Firstname5 Lastname5 ¹, Firstname6 Lastname6 ^3,1,2, Firstname7 Lastname7 ², Firstname8 Lastname8 ³, Firstname8 Lastname8 ^1,2

¹ University of YYY

² Company Name

³ Institute of WWW

0 citations · 40 references · arXiv (Cornell University)

Published on arXiv

2602.04448

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

RASA achieves near-perfect robustness against diverse jailbreak attacks across two MoE architectures while substantially reducing over-refusal and preserving general capabilities on MMLU, GSM8K, and TruthfulQA.

RASA

Novel technique introduced

Mixture-of-Experts (MoE) language models introduce unique challenges for safety alignment due to their sparse routing mechanisms, which can enable degenerate optimization behaviors under standard full-parameter fine-tuning. In our preliminary experiments, we observe that naively applying full-parameter safety fine-tuning to MoE models can reduce attack success rates through routing or expert dominance effects, rather than by directly repairing Safety-Critical Experts. To address this challenge, we propose RASA, a routing-aware expert-level alignment framework that explicitly repairs Safety-Critical Experts while preventing routing-based bypasses. RASA identifies experts disproportionately activated by successful jailbreaks, selectively fine-tunes only these experts under fixed routing, and subsequently enforces routing consistency with safety-aligned contexts. Across two representative MoE architectures and a diverse set of jailbreak attacks, RASA achieves near-perfect robustness, strong cross-attack generalization, and substantially reduced over-refusal, while preserving general capabilities on benchmarks such as MMLU, GSM8K, and TruthfulQA. Our results suggest that robust MoE safety alignment benefits from targeted expert repair rather than global parameter updates, offering a practical and architecture-preserving alternative to prior approaches.

Key Contributions

Identifies Safety-Critical Experts disproportionately activated by successful jailbreaks in MoE LLMs using routing analysis
Selectively fine-tunes only safety-critical experts under fixed routing to directly repair jailbreak vulnerabilities rather than applying global parameter updates
Enforces routing consistency with safety-aligned contexts to prevent routing-based jailbreak bypasses, achieving near-perfect robustness with reduced over-refusal

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

inference_timetargeted

Datasets

MMLUGSM8KTruthfulQA

Applications

llm safety alignmentmixture-of-experts language models

Read PDF arXiv DOI

RASA: Routing-Aware Safety Alignment for Mixture-of-Experts Models

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Distractor Injection Attacks on Large Reasoning Models: Characterization and Defense

Bias Injection Attacks on RAG Databases and Sanitization Defenses

Matching Ranks Over Probability Yields Truly Deep Safety Alignment

Prompt Injection Vulnerability of Consensus Generating Applications in Digital Democracy

Steering MoE LLMs via Expert (De)Activation

Between a Rock and a Hard Place: The Tension Between Ethical Reasoning and Safety Alignment in LLMs

RAPO: Risk-Aware Preference Optimization for Generalizable Safe Reasoning

Robust Safety Monitoring of Language Models via Activation Watermarking