attack 2026

Sparse Models, Sparse Safety: Unsafe Routes in Mixture-of-Experts LLMs

Yukun Jiang , Hai Huang , Mingjie Li , Yage Zhang , Michael Backes , Yang Zhang

CISPA Helmholtz Center for Information Security

0 citations · 69 references · arXiv (Cornell University)

Published on arXiv

2602.08621

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

F-SOUR achieves average ASR of 0.90 and 0.98 on JailbreakBench and AdvBench across four MoE LLM families by exploiting adversarial routing configurations without modifying model weights.

F-SOUR

Novel technique introduced

By introducing routers to selectively activate experts in Transformer layers, the mixture-of-experts (MoE) architecture significantly reduces computational costs in large language models (LLMs) while maintaining competitive performance, especially for models with massive parameters. However, prior work has largely focused on utility and efficiency, leaving the safety risks associated with this sparse architecture underexplored. In this work, we show that the safety of MoE LLMs is as sparse as their architecture by discovering unsafe routes: routing configurations that, once activated, convert safe outputs into harmful ones. Specifically, we first introduce the Router Safety importance score (RoSais) to quantify the safety criticality of each layer's router. Manipulation of only the high-RoSais router(s) can flip the default route into an unsafe one. For instance, on JailbreakBench, masking 5 routers in DeepSeek-V2-Lite increases attack success rate (ASR) by over 4$\times$ to 0.79, highlighting an inherent risk that router manipulation may naturally occur in MoE LLMs. We further propose a Fine-grained token-layer-wise Stochastic Optimization framework to discover more concrete Unsafe Routes (F-SOUR), which explicitly considers the sequentiality and dynamics of input tokens. Across four representative MoE LLM families, F-SOUR achieves an average ASR of 0.90 and 0.98 on JailbreakBench and AdvBench, respectively. Finally, we outline defensive perspectives, including safety-aware route disabling and router training, as promising directions to safeguard MoE LLMs. We hope our work can inform future red-teaming and safeguarding of MoE LLMs. Our code is provided in https://github.com/TrustAIRLab/UnsafeMoE.

Key Contributions

RoSais: a metric quantifying each router's safety criticality in MoE LLMs, enabling targeted router manipulation that increases ASR by over 4× with only 5 masked routers
F-SOUR: a fine-grained token-layer-wise stochastic optimization framework that finds concrete adversarial routing configurations without modifying expert weights
Demonstrates that MoE sparse routing introduces a structural jailbreak dimension distinct from existing prompt- or weight-based attacks, achieving average ASR of 0.90/0.98 on JailbreakBench/AdvBench across four MoE LLM families

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

white_boxinference_timetargeted

Datasets

JailbreakBenchAdvBench

Applications

large language modelsmoe llmssafety alignment

Read PDF arXiv DOI Code

Sparse Models, Sparse Safety: Unsafe Routes in Mixture-of-Experts LLMs

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Knowing without Acting: The Disentangled Geometry of Safety Mechanisms in Large Language Models

Differentiated Directional Intervention A Framework for Evading LLM Safety Alignment

ShallowJail: Steering Jailbreaks against Large Language Models

Causal Front-Door Adjustment for Robust Jailbreak Attacks on LLMs

Selective Steering: Norm-Preserving Control Through Discriminative Layer Selection

MEUV: Achieving Fine-Grained Capability Activation in Large Language Models via Mutually Exclusive Unlock Vectors

The Rogue Scalpel: Activation Steering Compromises LLM Safety

Amnesia: Adversarial Semantic Layer Specific Activation Steering in Large Language Models