defense 2026

Towards Distillation-Resistant Large Language Models: An Information-Theoretic Perspective

Hao Fang ¹, Tianyi Zhang ¹, Tianqu Zhuang ¹, Jiawei Kong ¹, Kuofeng Gao ¹, Bin Chen ², Leqi Liang ¹, Shu-Tao Xia ¹, Ke Xu ¹

¹ Tsinghua University

² Harbin Institute of Technology, Shenzhen

0 citations · 25 references · arXiv (Cornell University)

Published on arXiv

2602.03396

Model Theft

OWASP ML Top 10 — ML05

Model Theft

OWASP LLM Top 10 — LLM10

Key Finding

The proposed CMI minimization defense significantly degrades distillation attack performance while preserving the teacher LLM's original task accuracy across multiple models and strong distillation algorithms.

CMI-based Anti-Distillation

Novel technique introduced

Proprietary large language models (LLMs) embody substantial economic value and are generally exposed only as black-box APIs, yet adversaries can still exploit their outputs to extract knowledge via distillation. Existing defenses focus exclusively on text-based distillation, leaving the important logit-based distillation largely unexplored. In this work, we analyze this problem and present an effective solution from an information-theoretic perspective. We characterize distillation-relevant information in teacher outputs using the conditional mutual information (CMI) between teacher logits and input queries conditioned on ground-truth labels. This quantity captures contextual information beneficial for model extraction, motivating us to defend distillation via CMI minimization. Guided by our theoretical analysis, we propose learning a transformation matrix that purifies the original outputs to enhance distillation resistance. We further derive a CMI-inspired anti-distillation objective to optimize this transformation, which effectively removes distillation-relevant information while preserving output utility. Extensive experiments across multiple LLMs and strong distillation algorithms demonstrate that the proposed method significantly degrades distillation performance while preserving task accuracy, effectively protecting models' intellectual property.

Key Contributions

Formalizes distillation-relevant information in LLM outputs using conditional mutual information (CMI) between teacher logits and input queries conditioned on ground-truth labels
Proposes a learnable transformation matrix that purifies model outputs to remove distillation-relevant information while preserving task utility
Derives a CMI-inspired anti-distillation training objective shown to significantly degrade student model performance across multiple LLMs and distillation algorithms

🛡️ Threat Analysis

Model Theft

The paper explicitly defends against model extraction via knowledge distillation — adversaries query a black-box LLM API and use its logit outputs to train a cloned student model. The proposed defense (CMI minimization + output transformation) directly protects model intellectual property from being extracted through distillation.

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

black_boxinference_time

Applications

proprietary llm apisknowledge distillation protection

Read PDF arXiv DOI

Towards Distillation-Resistant Large Language Models: An Information-Theoretic Perspective

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

RegionMarker: A Region-Triggered Semantic Watermarking Framework for Embedding-as-a-Service Copyright Protection

Practical Secure Inference Algorithm for Fine-tuned Large Language Model Based on Fully Homomorphic Encryption

Watermarks for Embeddings-as-a-Service Large Language Models

SecureInfer: Heterogeneous TEE-GPU Architecture for Privacy-Critical Tensors for Large Language Model Deployment

EditMF: Drawing an Invisible Fingerprint for Your Large Language Models

Reading Between the Lines: Towards Reliable Black-box LLM Fingerprinting via Zeroth-order Gradient Estimation

FNF: Functional Network Fingerprint for Large Language Models

From Essence to Defense: Adaptive Semantic-aware Watermarking for Embedding-as-a-Service Copyright Protection