benchmark 2026

DistillGuard: Evaluating Defenses Against LLM Knowledge Distillation

Bo Jiang

0 citations

α

Published on arXiv

2603.07835

Model Theft

OWASP ML Top 10 — ML05

Model Theft

OWASP LLM Top 10 — LLM10

Key Finding

Most output-level defenses fail to prevent knowledge distillation; only chain-of-thought removal substantially degrades student performance on math (31.4% vs. 67.8% baseline), while code generation remains unaffected.

DistillGuard

Novel technique introduced


Knowledge distillation from proprietary LLM APIs poses a growing threat to model providers, yet defenses against this attack remain fragmented and unevaluated. We present DistillGuard, a framework for systematically evaluating output-level defenses against LLM knowledge distillation. We introduce a taxonomy of three defense categories -- output perturbation, data poisoning, and information throttling -- and evaluate nine defense configurations using a standardized pipeline with Qwen3-14B as teacher and Qwen2.5-7B-Instruct as student across three benchmarks (MATH-500, HumanEval+, MT-Bench). Our results reveal that, in a same-family distillation setting against a naive attacker, most output-level defenses are surprisingly ineffective: paraphrasing-based perturbation barely degrades distilled student quality, and data poisoning primarily impairs conversational fluency while leaving task-specific capabilities intact. Only chain-of-thought removal substantially impairs mathematical reasoning (31.4\% vs.\ 67.8\% baseline), though code generation remains unaffected. These findings demonstrate that the effectiveness of distillation defenses is highly task-dependent and that current output-level approaches are insufficient to broadly prevent knowledge theft.


Key Contributions

  • Taxonomy of output-level distillation defenses into three categories: output perturbation, data poisoning, and information throttling
  • Standardized evaluation framework (DistillGuard) with metrics for distillation effectiveness and distillation cost across nine defense configurations
  • Empirical finding that most output-level defenses are insufficient: only chain-of-thought removal substantially impairs distillation quality, and only for reasoning-heavy tasks

🛡️ Threat Analysis

Model Theft

Knowledge distillation via API querying is a model theft attack — adversaries reconstruct a proprietary model's functionality by training a student on its outputs. The paper evaluates defenses (output perturbation, poisoning, throttling) specifically aimed at preventing this intellectual property theft.


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
black_boxinference_time
Datasets
MATH-500HumanEval+MT-Bench
Applications
llm apisproprietary language model protection