DistillGuard: Evaluating Defenses Against LLM Knowledge Distillation

Knowledge distillation from proprietary LLM APIs poses a growing threat to model providers, yet defenses against this attack remain fragmented and unevaluated. We present DistillGuard, a framework for systematically evaluating output-level defenses against LLM knowledge distillation. We introduce a taxonomy of three defense categories -- output perturbation, data poisoning, and information throttling -- and evaluate nine defense configurations using a standardized pipeline with Qwen3-14B as teacher and Qwen2.5-7B-Instruct as student across three benchmarks (MATH-500, HumanEval+, MT-Bench). Our results reveal that, in a same-family distillation setting against a naive attacker, most output-level defenses are surprisingly ineffective: paraphrasing-based perturbation barely degrades distilled student quality, and data poisoning primarily impairs conversational fluency while leaving task-specific capabilities intact. Only chain-of-thought removal substantially impairs mathematical reasoning (31.4\% vs.\ 67.8\% baseline), though code generation remains unaffected. These findings demonstrate that the effectiveness of distillation defenses is highly task-dependent and that current output-level approaches are insufficient to broadly prevent knowledge theft.

Key Contributions

Taxonomy of output-level distillation defenses into three categories: output perturbation, data poisoning, and information throttling
Standardized evaluation framework (DistillGuard) with metrics for distillation effectiveness and distillation cost across nine defense configurations
Empirical finding that most output-level defenses are insufficient: only chain-of-thought removal substantially impairs distillation quality, and only for reasoning-heavy tasks

🛡️ Threat Analysis

Model Theft

Knowledge distillation via API querying is a model theft attack — adversaries reconstruct a proprietary model's functionality by training a student on its outputs. The paper evaluates defenses (output perturbation, poisoning, throttling) specifically aimed at preventing this intellectual property theft.

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

black_boxinference_time

Datasets

MATH-500HumanEval+MT-Bench

Applications

2025 0 cit.

Model Theft

86%

DistillGuard: Evaluating Defenses Against LLM Knowledge Distillation

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

SoK: Large Language Model Copyright Auditing via Fingerprinting

FNF: Functional Network Fingerprint for Large Language Models

Black-Box Guardrail Reverse-engineering Attack

Practical Secure Inference Algorithm for Fine-tuned Large Language Model Based on Fully Homomorphic Encryption

From Essence to Defense: Adaptive Semantic-aware Watermarking for Embedding-as-a-Service Copyright Protection

Inhibitory Attacks on Backdoor-based Fingerprinting for Large Language Models

EditMF: Drawing an Invisible Fingerprint for Your Large Language Models

Watermarks for Embeddings-as-a-Service Large Language Models