defense 2025

Leave It to the Experts: Detecting Knowledge Distillation via MoE Expert Signatures

Pingzhi Li 1, Morris Yu-Chao Huang 1, Zhen Tan 2, Qingquan Song 3, Jie Peng 1, Kai Zou 4, Yu Cheng 5, Kaidi Xu 6, Tianlong Chen 1

0 citations · 46 references · arXiv

α

Published on arXiv

2510.16968

Model Theft

OWASP ML Top 10 — ML05

Key Finding

Achieves >94% KD detection accuracy across diverse scenarios with strong robustness to prompt-based evasion, outperforming existing self-identity and output-similarity baselines

Shadow-MoE

Novel technique introduced


Knowledge Distillation (KD) accelerates training of large language models (LLMs) but poses intellectual property protection and LLM diversity risks. Existing KD detection methods based on self-identity or output similarity can be easily evaded through prompt engineering. We present a KD detection framework effective in both white-box and black-box settings by exploiting an overlooked signal: the transfer of MoE "structural habits", especially internal routing patterns. Our approach analyzes how different experts specialize and collaborate across various inputs, creating distinctive fingerprints that persist through the distillation process. To extend beyond the white-box setup and MoE architectures, we further propose Shadow-MoE, a black-box method that constructs proxy MoE representations via auxiliary distillation to compare these patterns between arbitrary model pairs. We establish a comprehensive, reproducible benchmark that offers diverse distilled checkpoints and an extensible framework to facilitate future research. Extensive experiments demonstrate >94% detection accuracy across various scenarios and strong robustness to prompt-based evasion, outperforming existing baselines while highlighting the structural habits transfer in LLMs.


Key Contributions

  • Identifies MoE expert routing patterns ('structural habits') as persistent, transferable fingerprints that survive the knowledge distillation process and can serve as model provenance signals
  • Proposes Shadow-MoE, a black-box KD detection method that constructs auxiliary proxy MoE representations via distillation to compare routing signatures between arbitrary model pairs regardless of architecture
  • Establishes a comprehensive reproducible benchmark with diverse distilled LLM checkpoints and an extensible evaluation framework for future KD-detection research

🛡️ Threat Analysis

Model Theft

Knowledge distillation is framed as a model theft vector (IP extraction); the paper proposes model fingerprinting via MoE structural routing habits to detect when a model has been cloned through KD — a direct defense against model intellectual property theft. The Shadow-MoE technique is analogous to model fingerprinting/watermark verification, but driven by emergent routing signatures rather than injected triggers.


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
white_boxblack_boxinference_time
Applications
llm intellectual property protectionknowledge distillation detectionllm provenance verification