defense arXiv Oct 19, 2025 · Oct 2025
Pingzhi Li, Morris Yu-Chao Huang, Zhen Tan et al. · UNC-Chapel Hill · Arizona State University +4 more
Detects LLM knowledge distillation (model theft) by fingerprinting MoE expert routing patterns in both white-box and black-box settings
Model Theft nlp
Knowledge Distillation (KD) accelerates training of large language models (LLMs) but poses intellectual property protection and LLM diversity risks. Existing KD detection methods based on self-identity or output similarity can be easily evaded through prompt engineering. We present a KD detection framework effective in both white-box and black-box settings by exploiting an overlooked signal: the transfer of MoE "structural habits", especially internal routing patterns. Our approach analyzes how different experts specialize and collaborate across various inputs, creating distinctive fingerprints that persist through the distillation process. To extend beyond the white-box setup and MoE architectures, we further propose Shadow-MoE, a black-box method that constructs proxy MoE representations via auxiliary distillation to compare these patterns between arbitrary model pairs. We establish a comprehensive, reproducible benchmark that offers diverse distilled checkpoints and an extensible framework to facilitate future research. Extensive experiments demonstrate >94% detection accuracy across various scenarios and strong robustness to prompt-based evasion, outperforming existing baselines while highlighting the structural habits transfer in LLMs.
llm transformer UNC-Chapel Hill · Arizona State University · Individual Contributor +3 more
defense arXiv Jan 31, 2026 · 9w ago
Ziyao Wang, Nizhang Li, Pingzhi Li et al. · College Park · Macau University of Science and Technology +1 more
Defends open-source LLMs against unauthorized fine-tuning by hiding a sparse subnetwork mask, degrading adaptation without the key
Transfer Learning Attack Model Theft nlp
Open-sourcing foundation models (FMs) enables broad reuse but also exposes model trainers to economic and safety risks from unrestricted downstream fine-tuning. We address this problem by building non-fine-tunable foundation models: models that remain broadly usable in their released form while yielding limited adaptation gains under task-agnostic unauthorized fine-tuning. We propose Private Mask Pre-Training (PMP), a pre-training framework that concentrates representation learning into a sparse subnetwork identified early in training. The binary mask defining this subnetwork is kept private, and only the final dense weights are released. This forces unauthorized fine-tuning without access to the mask to update parameters misaligned with pretraining subspace, inducing an intrinsic mismatch between the fine-tuning objective and the pre-training geometry. We provide theoretical analysis showing that this mismatch destabilizes gradient-based adaptation and bounds fine-tuning gains. Empirical results on large language models demonstrating that PMP preserves base model performance while consistently degrading unauthorized fine-tuning across a wide range of downstream tasks, with the strength of non-fine-tunability controlled by the mask ratio.
llm transformer College Park · Macau University of Science and Technology · The University of North Carolina at Chapel Hill