attack 2025

Who Speaks for the Trigger? Dynamic Expert Routing in Backdoored Mixture-of-Experts Transformers

0 citations · arXiv

Published on arXiv

2510.13462

Model Poisoning

OWASP ML Top 10 — ML10

Key Finding

BadSwitch achieves up to 100% attack success rate while maintaining highest clean accuracy among baselines, and retains 94.07% ASR and 87.18% ACC on AGNews even after applying defense mechanisms.

BadSwitch

Novel technique introduced

Large language models (LLMs) with Mixture-of-Experts (MoE) architectures achieve impressive performance and efficiency by dynamically routing inputs to specialized subnetworks, known as experts. However, this sparse routing mechanism inherently exhibits task preferences due to expert specialization, introducing a new and underexplored vulnerability to backdoor attacks. In this work, we investigate the feasibility and effectiveness of injecting backdoors into MoE-based LLMs by exploiting their inherent expert routing preferences. We thus propose BadSwitch, a novel backdoor framework that integrates task-coupled dynamic trigger optimization with a sensitivity-guided Top-S expert tracing mechanism. Our approach jointly optimizes trigger embeddings during pretraining while identifying S most sensitive experts, subsequently constraining the Top-K gating mechanism to these targeted experts. Unlike traditional backdoor attacks that rely on superficial data poisoning or model editing, BadSwitch primarily embeds malicious triggers into expert routing paths with strong task affinity, enabling precise and stealthy model manipulation. Through comprehensive evaluations across three prominent MoE architectures (Switch Transformer, QwenMoE, and DeepSeekMoE), we demonstrate that BadSwitch can efficiently hijack pre-trained models with up to 100% success rate (ASR) while maintaining the highest clean accuracy (ACC) among all baselines. Furthermore, BadSwitch exhibits strong resilience against both text-level and model-level defense mechanisms, achieving 94.07% ASR and 87.18% ACC on the AGNews dataset. Our analysis of expert activation patterns reveals fundamental insights into MoE vulnerabilities. We anticipate this work will expose security risks in MoE systems and contribute to advancing AI safety.

Key Contributions

BadSwitch framework that exploits MoE expert routing specialization to inject stealthy backdoors via task-coupled dynamic trigger optimization
Sensitivity-guided Top-S expert tracing mechanism that identifies and constrains routing to the most trigger-responsive experts per transformer layer
Comprehensive evaluation across Switch Transformer, QwenMoE, and DeepSeekMoE demonstrating up to 100% ASR with high clean accuracy and strong resilience against text-level and model-level defenses

🛡️ Threat Analysis

Model Poisoning

BadSwitch is a backdoor/trojan attack that embeds hidden, trigger-activated malicious behavior into MoE-based LLMs by constraining routing to sensitive experts — the model behaves normally on clean inputs and activates the backdoor only on trigger-containing inputs, which is the textbook ML10 threat model.

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

white_boxtraining_timetargeted

Datasets

AGNews

Applications

text classificationlarge language models

Read PDF arXiv DOI

Who Speaks for the Trigger? Dynamic Expert Routing in Backdoored Mixture-of-Experts Transformers

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

SASER: Stego attacks on open-source LLMs

Trigger Where It Hurts: Unveiling Hidden Backdoors through Sensitivity with Sensitron

Adversarial Contrastive Learning for LLM Quantization Attacks

Fewer Weights, More Problems: A Practical Attack on LLM Pruning

The Achilles' Heel of LLMs: How Altering a Handful of Neurons Can Cripple Language Abilities

Turn-Based Structural Triggers: Prompt-Free Backdoors in Multi-Turn LLMs

TFL: Targeted Bit-Flip Attack on Large Language Model

Inverting Trojans in LLMs