Turning Black Box into White Box: Dataset Distillation Leaks
Huajie Chen 1, Tianqing Zhu 1, Yuchen Zhong 1, Yang Zhang 2, Shang Wang 3, Feng He 3, Lefeng Zhang 1, Jialiang Shen 4, Minghao Wang 1, Wanlei Zhou 1
Published on arXiv
2603.01053
Model Inversion Attack
OWASP ML Top 10 — ML03
Membership Inference Attack
OWASP ML Top 10 — ML04
Key Finding
IRA accurately predicts both the distillation algorithm and model architecture from synthetic datasets, and successfully performs membership inference and recovers sensitive real training samples.
Information Revelation Attack (IRA)
Novel technique introduced
Dataset distillation compresses a large real dataset into a small synthetic one, enabling models trained on the synthetic data to achieve performance comparable to those trained on the real data. Although synthetic datasets are assumed to be privacy-preserving, we show that existing distillation methods can cause severe privacy leakage because synthetic datasets implicitly encode the weight trajectories of the distilled model, they become over-informative and exploitable by adversaries. To expose this risk, we introduce the Information Revelation Attack (IRA) against state-of-the-art distillation techniques. Experiments show that IRA accurately predicts both the distillation algorithm and model architecture, and can successfully infer membership and recover sensitive samples from the real dataset.
Key Contributions
- Architecture inference stage that predicts distillation algorithm and model architecture from loss trajectories, effectively converting a black-box setting into white-box for the adversary
- Membership inference attack leveraging the locally cloned white-box model's hidden-layer and final-layer outputs
- Enhanced dual-network diffusion framework with trajectory loss for reconstructing real training samples from synthetic distilled datasets
🛡️ Threat Analysis
The final stage of IRA uses a dual-network diffusion framework to reconstruct sensitive real training samples from synthetic distilled datasets — a direct model inversion / training data reconstruction attack with an explicit adversarial threat model.
The second stage of IRA explicitly performs membership inference — determining whether a given sample was in the real training dataset — using hidden-layer and final-layer outputs of a locally cloned model.