benchmark 2025

I Stolenly Swear That I Am Up to (No) Good: Design and Evaluation of Model Stealing Attacks

Daryna Oliynyk 1, Rudolf Mayer 2,3, Kathrin Grosse 4, Andreas Rauber 3

0 citations

α

Published on arXiv

2508.21654

Model Theft

OWASP ML Top 10 — ML05

Key Finding

Analysis reveals only a small fraction of prior model stealing attacks can be meaningfully compared, and roughly a quarter of attack configurations have been studied in only one or two works, exposing major evaluation gaps in the field.


Model stealing attacks endanger the confidentiality of machine learning models offered as a service. Although these models are kept secret, a malicious party can query a model to label data samples and train their own substitute model, violating intellectual property. While novel attacks in the field are continually being published, their design and evaluations are not standardised, making it challenging to compare prior works and assess progress in the field. This paper is the first to address this gap by providing recommendations for designing and evaluating model stealing attacks. To this end, we study the largest group of attacks that rely on training a substitute model -- those attacking image classification models. We propose the first comprehensive threat model and develop a framework for attack comparison. Further, we analyse attack setups from related works to understand which tasks and models have been studied the most. Based on our findings, we present best practices for attack development before, during, and beyond experiments and derive an extensive list of open research questions regarding the evaluation of model stealing attacks. Our findings and recommendations also transfer to other problem domains, hence establishing the first generic evaluation methodology for model stealing attacks.


Key Contributions

  • First comprehensive threat model covering attacker knowledge, capabilities, and goals for model stealing attacks
  • Systematization of 40+ substitute training attacks on image classifiers along the proposed threat model dimensions
  • First generic evaluation framework and best-practices guide for comparable model stealing attack assessment, revealing that only a small fraction of prior attacks can be fairly compared

🛡️ Threat Analysis

Model Theft

Paper is entirely focused on model stealing (extraction) attacks — specifically substitute training attacks on image classifiers — proposing a threat model, attack comparison framework, and evaluation methodology for this class of IP theft attacks.


Details

Domains
vision
Model Types
cnntransformer
Threat Tags
black_boxinference_time
Applications
image classificationmachine learning as a service