Towards Distillation-Resistant Large Language Models: An Information-Theoretic Perspective
Hao Fang 1, Tianyi Zhang 1, Tianqu Zhuang 1, Jiawei Kong 1, Kuofeng Gao 1, Bin Chen 2, Leqi Liang 1, Shu-Tao Xia 1, Ke Xu 1
Published on arXiv
2602.03396
Model Theft
OWASP ML Top 10 — ML05
Model Theft
OWASP LLM Top 10 — LLM10
Key Finding
The proposed CMI minimization defense significantly degrades distillation attack performance while preserving the teacher LLM's original task accuracy across multiple models and strong distillation algorithms.
CMI-based Anti-Distillation
Novel technique introduced
Proprietary large language models (LLMs) embody substantial economic value and are generally exposed only as black-box APIs, yet adversaries can still exploit their outputs to extract knowledge via distillation. Existing defenses focus exclusively on text-based distillation, leaving the important logit-based distillation largely unexplored. In this work, we analyze this problem and present an effective solution from an information-theoretic perspective. We characterize distillation-relevant information in teacher outputs using the conditional mutual information (CMI) between teacher logits and input queries conditioned on ground-truth labels. This quantity captures contextual information beneficial for model extraction, motivating us to defend distillation via CMI minimization. Guided by our theoretical analysis, we propose learning a transformation matrix that purifies the original outputs to enhance distillation resistance. We further derive a CMI-inspired anti-distillation objective to optimize this transformation, which effectively removes distillation-relevant information while preserving output utility. Extensive experiments across multiple LLMs and strong distillation algorithms demonstrate that the proposed method significantly degrades distillation performance while preserving task accuracy, effectively protecting models' intellectual property.
Key Contributions
- Formalizes distillation-relevant information in LLM outputs using conditional mutual information (CMI) between teacher logits and input queries conditioned on ground-truth labels
- Proposes a learnable transformation matrix that purifies model outputs to remove distillation-relevant information while preserving task utility
- Derives a CMI-inspired anti-distillation training objective shown to significantly degrade student model performance across multiple LLMs and distillation algorithms
🛡️ Threat Analysis
The paper explicitly defends against model extraction via knowledge distillation — adversaries query a black-box LLM API and use its logit outputs to train a cloned student model. The proposed defense (CMI minimization + output transformation) directly protects model intellectual property from being extracted through distillation.