benchmark 2025

Membership and Memorization in LLM Knowledge Distillation

Ziqi Zhang 1, Ali Shahin Shamsabadi 2, Hanxiao Lu 3, Yifeng Cai 1, Hamed Haddadi 2,4

0 citations

α

Published on arXiv

2508.07054

Membership Inference Attack

OWASP ML Top 10 — ML04

Model Inversion Attack

OWASP ML Top 10 — ML03

Sensitive Information Disclosure

OWASP LLM Top 10 — LLM06

Key Finding

All six evaluated LLM KD techniques transfer membership and memorization privacy risks from teacher to student, but risk magnitude varies substantially across KD methods, objectives, and model blocks.


Recent advances in Knowledge Distillation (KD) aim to mitigate the high computational demands of Large Language Models (LLMs) by transferring knowledge from a large ''teacher'' to a smaller ''student'' model. However, students may inherit the teacher's privacy when the teacher is trained on private data. In this work, we systematically characterize and investigate membership and memorization privacy risks inherent in six LLM KD techniques. Using instruction-tuning settings that span seven NLP tasks, together with three teacher model families (GPT-2, LLAMA-2, and OPT), and various size student models, we demonstrate that all existing LLM KD approaches carry membership and memorization privacy risks from the teacher to its students. However, the extent of privacy risks varies across different KD techniques. We systematically analyse how key LLM KD components (KD objective functions, student training data and NLP tasks) impact such privacy risks. We also demonstrate a significant disagreement between memorization and membership privacy risks of LLM KD techniques. Finally, we characterize per-block privacy risk and demonstrate that the privacy risk varies across different blocks by a large margin.


Key Contributions

  • Systematic characterization of membership inference and memorization privacy risks across six LLM knowledge distillation techniques spanning three model families (GPT-2, LLAMA-2, OPT) and seven NLP tasks
  • Analysis of how KD objective functions, student training data, and NLP task type affect the magnitude of inherited privacy risks
  • Discovery of significant disagreement between memorization and membership privacy risk signals in LLM KD, and identification of per-block privacy risk variation

🛡️ Threat Analysis

Model Inversion Attack

Paper explicitly studies memorization (training data extraction) risk, measuring how much of the teacher's private training data is memorized and recoverable from student models — aligns with model inversion/training data reconstruction.

Membership Inference Attack

Paper directly measures membership inference attack success rates against student models to determine whether the teacher's private training data membership is exposed through KD — core ML04 threat.


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
training_timeblack_box
Datasets
instruction-tuning datasets across 7 NLP tasks
Applications
llm compressionknowledge distillationinstruction tuning