defense 2025

Leave It to the Experts: Detecting Knowledge Distillation via MoE Expert Signatures

0 citations · 46 references · arXiv

Published on arXiv

2510.16968

Model Theft

OWASP ML Top 10 — ML05

Key Finding

Achieves >94% KD detection accuracy across diverse scenarios with strong robustness to prompt-based evasion, outperforming existing self-identity and output-similarity baselines

Shadow-MoE

Novel technique introduced

Knowledge Distillation (KD) accelerates training of large language models (LLMs) but poses intellectual property protection and LLM diversity risks. Existing KD detection methods based on self-identity or output similarity can be easily evaded through prompt engineering. We present a KD detection framework effective in both white-box and black-box settings by exploiting an overlooked signal: the transfer of MoE "structural habits", especially internal routing patterns. Our approach analyzes how different experts specialize and collaborate across various inputs, creating distinctive fingerprints that persist through the distillation process. To extend beyond the white-box setup and MoE architectures, we further propose Shadow-MoE, a black-box method that constructs proxy MoE representations via auxiliary distillation to compare these patterns between arbitrary model pairs. We establish a comprehensive, reproducible benchmark that offers diverse distilled checkpoints and an extensible framework to facilitate future research. Extensive experiments demonstrate >94% detection accuracy across various scenarios and strong robustness to prompt-based evasion, outperforming existing baselines while highlighting the structural habits transfer in LLMs.

Key Contributions

Identifies MoE expert routing patterns ('structural habits') as persistent, transferable fingerprints that survive the knowledge distillation process and can serve as model provenance signals
Proposes Shadow-MoE, a black-box KD detection method that constructs auxiliary proxy MoE representations via distillation to compare routing signatures between arbitrary model pairs regardless of architecture
Establishes a comprehensive reproducible benchmark with diverse distilled LLM checkpoints and an extensible evaluation framework for future KD-detection research

🛡️ Threat Analysis

Model Theft

Knowledge distillation is framed as a model theft vector (IP extraction); the paper proposes model fingerprinting via MoE structural routing habits to detect when a model has been cloned through KD — a direct defense against model intellectual property theft. The Shadow-MoE technique is analogous to model fingerprinting/watermark verification, but driven by emergent routing signatures rather than injected triggers.

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

white_boxblack_boxinference_time

Applications

llm intellectual property protectionknowledge distillation detectionllm provenance verification

Read PDF arXiv DOI Code

Leave It to the Experts: Detecting Knowledge Distillation via MoE Expert Signatures

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Provable Model Provenance Set for Large Language Models

CryptoGen: Secure Transformer Generation with Encrypted KV-Cache Reuse

RAG-WM: An Efficient Black-Box Watermarking Approach for Retrieval-Augmented Generation of Large Language Models

AttestLLM: Efficient Attestation Framework for Billion-scale On-device LLMs

Ghost in the Transformer: Detecting Model Reuse with Invariant Spectral Signatures

DistilLock: Safeguarding LLMs from Unauthorized Knowledge Distillation on the Edge

From Essence to Defense: Adaptive Semantic-aware Watermarking for Embedding-as-a-Service Copyright Protection

PromptCOS: Towards Content-only System Prompt Copyright Auditing for LLMs