defense 2026

Protecting Language Models Against Unauthorized Distillation through Trace Rewriting

Xinhang Ma , William Yeoh , Ning Zhang , Yevgeniy Vorobeychik

0 citations · 43 references

α

Published on arXiv

2602.15143

Model Theft

OWASP ML Top 10 — ML05

Model Theft

OWASP LLM Top 10 — LLM10

Key Finding

Instruction-based trace rewriting achieves strong anti-distillation effect while maintaining or improving teacher performance, and enables watermark detection with essentially no false alarms in student models.

Trace Rewriting

Novel technique introduced


Knowledge distillation is a widely adopted technique for transferring capabilities from LLMs to smaller, more efficient student models. However, unauthorized use of knowledge distillation takes unfair advantage of the considerable effort and cost put into developing frontier models. We investigate methods for modifying teacher-generated reasoning traces to achieve two objectives that deter unauthorized distillation: (1) \emph{anti-distillation}, or degrading the training usefulness of query responses, and (2) \emph{API watermarking}, which embeds verifiable signatures in student models. We introduce several approaches for dynamically rewriting a teacher's reasoning outputs while preserving answer correctness and semantic coherence. Two of these leverage the rewriting capabilities of LLMs, while others use gradient-based techniques. Our experiments show that a simple instruction-based rewriting approach achieves a strong anti-distillation effect while maintaining or even improving teacher performance. Furthermore, we show that our rewriting approach also enables highly reliable watermark detection with essentially no false alarms.


Key Contributions

  • Anti-distillation via dynamic trace rewriting that degrades training usefulness of teacher outputs while preserving answer correctness
  • API watermarking scheme that embeds verifiable signatures into student models through modified teacher reasoning traces, enabling high-confidence ownership detection with near-zero false alarms
  • Empirical evaluation showing simple instruction-based rewriting outperforms gradient-based approaches in both anti-distillation strength and teacher performance preservation

🛡️ Threat Analysis

Model Theft

The paper's primary contribution is defending against model theft via unauthorized knowledge distillation — the student model extracts the teacher's capabilities through API queries. Both proposed defenses (anti-distillation and API watermarking embedded into student model behavior) directly target this IP theft vector.


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
black_boxinference_timetraining_time
Applications
llm api protectionknowledge distillation defensemodel ip protection