defense 2026

Towards Distillation-Resistant Large Language Models: An Information-Theoretic Perspective

Hao Fang 1, Tianyi Zhang 1, Tianqu Zhuang 1, Jiawei Kong 1, Kuofeng Gao 1, Bin Chen 2, Leqi Liang 1, Shu-Tao Xia 1, Ke Xu 1

0 citations · 25 references · arXiv (Cornell University)

α

Published on arXiv

2602.03396

Model Theft

OWASP ML Top 10 — ML05

Model Theft

OWASP LLM Top 10 — LLM10

Key Finding

The proposed CMI minimization defense significantly degrades distillation attack performance while preserving the teacher LLM's original task accuracy across multiple models and strong distillation algorithms.

CMI-based Anti-Distillation

Novel technique introduced


Proprietary large language models (LLMs) embody substantial economic value and are generally exposed only as black-box APIs, yet adversaries can still exploit their outputs to extract knowledge via distillation. Existing defenses focus exclusively on text-based distillation, leaving the important logit-based distillation largely unexplored. In this work, we analyze this problem and present an effective solution from an information-theoretic perspective. We characterize distillation-relevant information in teacher outputs using the conditional mutual information (CMI) between teacher logits and input queries conditioned on ground-truth labels. This quantity captures contextual information beneficial for model extraction, motivating us to defend distillation via CMI minimization. Guided by our theoretical analysis, we propose learning a transformation matrix that purifies the original outputs to enhance distillation resistance. We further derive a CMI-inspired anti-distillation objective to optimize this transformation, which effectively removes distillation-relevant information while preserving output utility. Extensive experiments across multiple LLMs and strong distillation algorithms demonstrate that the proposed method significantly degrades distillation performance while preserving task accuracy, effectively protecting models' intellectual property.


Key Contributions

  • Formalizes distillation-relevant information in LLM outputs using conditional mutual information (CMI) between teacher logits and input queries conditioned on ground-truth labels
  • Proposes a learnable transformation matrix that purifies model outputs to remove distillation-relevant information while preserving task utility
  • Derives a CMI-inspired anti-distillation training objective shown to significantly degrade student model performance across multiple LLMs and distillation algorithms

🛡️ Threat Analysis

Model Theft

The paper explicitly defends against model extraction via knowledge distillation — adversaries query a black-box LLM API and use its logit outputs to train a cloned student model. The proposed defense (CMI minimization + output transformation) directly protects model intellectual property from being extracted through distillation.


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
black_boxinference_time
Applications
proprietary llm apisknowledge distillation protection