Stabilizing Data-Free Model Extraction

Model extraction is a severe threat to Machine Learning-as-a-Service systems, especially through data-free approaches, where dishonest users can replicate the functionality of a black-box target model without access to realistic data. Despite recent advancements, existing data-free model extraction methods suffer from the oscillating accuracy of the substitute model. This oscillation, which could be attributed to the constant shift in the generated data distribution during the attack, makes the attack impractical since the optimal substitute model cannot be determined without access to the target model's in-distribution data. Hence, we propose MetaDFME, a novel data-free model extraction method that employs meta-learning in the generator training to reduce the distribution shift, aiming to mitigate the substitute model's accuracy oscillation. In detail, we train our generator to iteratively capture the meta-representations of the synthetic data during the attack. These meta-representations can be adapted with a few steps to produce data that facilitates the substitute model to learn from the target model while reducing the effect of distribution shifts. Our experiments on popular baseline image datasets, MNIST, SVHN, CIFAR-10, and CIFAR-100, demonstrate that MetaDFME outperforms the current state-of-the-art data-free model extraction method while exhibiting a more stable substitute model's accuracy during the attack.

Key Contributions

MetaDFME: a data-free model extraction method using meta-learning in generator training to minimize distribution shift across attack iterations
Two-loop (inner/outer) generator optimization that captures meta-representations of synthetic data, enabling stable substitute model accuracy without access to in-distribution data
Outperforms state-of-the-art DFME on CIFAR-10/100, MNIST, and SVHN while significantly reducing substitute model accuracy oscillation

🛡️ Threat Analysis

Model Theft

Directly proposes a novel model extraction attack — cloning a black-box target model's functionality into an attacker-owned substitute model without access to real training data. The core contribution is advancing model theft via a meta-learning generator that improves attack practicality.