α

Published on arXiv

2509.11159

Model Theft

OWASP ML Top 10 — ML05

Key Finding

MetaDFME outperforms the state-of-the-art data-free model extraction method on all tested datasets while producing substantially more stable substitute model accuracy across attack iterations.

MetaDFME

Novel technique introduced


Model extraction is a severe threat to Machine Learning-as-a-Service systems, especially through data-free approaches, where dishonest users can replicate the functionality of a black-box target model without access to realistic data. Despite recent advancements, existing data-free model extraction methods suffer from the oscillating accuracy of the substitute model. This oscillation, which could be attributed to the constant shift in the generated data distribution during the attack, makes the attack impractical since the optimal substitute model cannot be determined without access to the target model's in-distribution data. Hence, we propose MetaDFME, a novel data-free model extraction method that employs meta-learning in the generator training to reduce the distribution shift, aiming to mitigate the substitute model's accuracy oscillation. In detail, we train our generator to iteratively capture the meta-representations of the synthetic data during the attack. These meta-representations can be adapted with a few steps to produce data that facilitates the substitute model to learn from the target model while reducing the effect of distribution shifts. Our experiments on popular baseline image datasets, MNIST, SVHN, CIFAR-10, and CIFAR-100, demonstrate that MetaDFME outperforms the current state-of-the-art data-free model extraction method while exhibiting a more stable substitute model's accuracy during the attack.


Key Contributions

  • MetaDFME: a data-free model extraction method using meta-learning in generator training to minimize distribution shift across attack iterations
  • Two-loop (inner/outer) generator optimization that captures meta-representations of synthetic data, enabling stable substitute model accuracy without access to in-distribution data
  • Outperforms state-of-the-art DFME on CIFAR-10/100, MNIST, and SVHN while significantly reducing substitute model accuracy oscillation

🛡️ Threat Analysis

Model Theft

Directly proposes a novel model extraction attack — cloning a black-box target model's functionality into an attacker-owned substitute model without access to real training data. The core contribution is advancing model theft via a meta-learning generator that improves attack practicality.


Details

Domains
vision
Model Types
cnngenerative
Threat Tags
black_boxinference_timehard_label
Datasets
MNISTSVHNCIFAR-10CIFAR-100
Applications
image classificationmlaas