attack 2026

Expert Selections In MoE Models Reveal (Almost) As Much As Text

Amir Nuriyev 1, Gabriel Kulp 2,3

0 citations · 9 references · arXiv (Cornell University)

α

Published on arXiv

2602.04105

Model Inversion Attack

OWASP ML Top 10 — ML03

Sensitive Information Disclosure

OWASP LLM Top 10 — LLM06

Key Finding

A transformer sequence decoder recovers 91.2% of tokens top-1 (94.8% top-10) from expert selection traces alone on 32-token sequences, versus 63.1% for a 3-layer MLP baseline.


We present a text-reconstruction attack on mixture-of-experts (MoE) language models that recovers tokens from expert selections alone. In MoE models, each token is routed to a subset of expert subnetworks; we show these routing decisions leak substantially more information than previously understood. Prior work using logistic regression achieves limited reconstruction; we show that a 3-layer MLP improves this to 63.1% top-1 accuracy, and that a transformer-based sequence decoder recovers 91.2% of tokens top-1 (94.8% top-10) on 32-token sequences from OpenWebText after training on 100M tokens. These results connect MoE routing to the broader literature on embedding inversion. We outline practical leakage scenarios (e.g., distributed inference and side channels) and show that adding noise reduces but does not eliminate reconstruction. Our findings suggest that expert selections in MoE deployments should be treated as sensitive as the underlying text.


Key Contributions

  • Demonstrates that MoE expert selections leak substantially more information than previously understood, enabling high-fidelity text reconstruction from routing traces alone
  • Proposes a transformer-based sequence decoder that achieves 91.2% top-1 / 94.8% top-10 token recovery on 32-token sequences — a major improvement over prior logistic regression baselines
  • Enumerates practical attack surfaces (distributed inference, pipeline-parallel MoE, physical side channels) and shows that noise addition reduces but does not eliminate reconstruction

🛡️ Threat Analysis

Model Inversion Attack

The attack explicitly frames itself as embedding inversion — recovering text from discrete intermediate model representations (expert selection traces). The mechanism mirrors ML03's 'embedding inversion' sub-category: an adversary observes internal routing signals and reconstructs the underlying token sequence, analogous to inverting embedding vectors to recover text.


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
grey_boxinference_time
Datasets
OpenWebText
Applications
llm inferencedistributed inferencemoe language model deployments