benchmark 2025

MER-Inspector: Assessing model extraction risks from an attack-agnostic perspective

Xinwei Zhang 1, Haibo Hu 1, Qingqing Ye 1, Li Bai 1, Huadi Zheng 2

4 citations · 64 references · WWW

α

Published on arXiv

2509.18578

Model Theft

OWASP ML Top 10 — ML05

Key Finding

MER-Inspector correctly compares the relative model extraction risk of any two architectures with up to 89.58% accuracy across 16 architectures and 5 datasets.

MER-Inspector (Model Extraction Risk Inspector) / Model Recovery Complexity (MRC)

Novel technique introduced


Information leakage issues in machine learning-based Web applications have attracted increasing attention. While the risk of data privacy leakage has been rigorously analyzed, the theory of model function leakage, known as Model Extraction Attacks (MEAs), has not been well studied. In this paper, we are the first to understand MEAs theoretically from an attack-agnostic perspective and to propose analytical metrics for evaluating model extraction risks. By using the Neural Tangent Kernel (NTK) theory, we formulate the linearized MEA as a regularized kernel classification problem and then derive the fidelity gap and generalization error bounds of the attack performance. Based on these theoretical analyses, we propose a new theoretical metric called Model Recovery Complexity (MRC), which measures the distance of weight changes between the victim and surrogate models to quantify risk. Additionally, we find that victim model accuracy, which shows a strong positive correlation with model extraction risk, can serve as an empirical metric. By integrating these two metrics, we propose a framework, namely Model Extraction Risk Inspector (MER-Inspector), to compare the extraction risks of models under different model architectures by utilizing relative metric values. We conduct extensive experiments on 16 model architectures and 5 datasets. The experimental results demonstrate that the proposed metrics have a high correlation with model extraction risks, and MER-Inspector can accurately compare the extraction risks of any two models with up to 89.58%.


Key Contributions

  • First attack-agnostic theoretical analysis of MEAs using Neural Tangent Kernel (NTK) theory, yielding fidelity gap and generalization error bounds
  • Two complementary risk metrics: Model Recovery Complexity (MRC, theoretical) and Victim Model Accuracy (VMA, empirical), both shown to strongly correlate with actual extraction risk
  • MER-Inspector framework that integrates MRC and VMA to compare extraction risk between any two model architectures, validated across 16 architectures and 5 datasets with up to 89.58% accuracy

🛡️ Threat Analysis

Model Theft

The paper is entirely focused on Model Extraction Attacks (MEAs) — the mechanism by which adversaries steal model functionality via black-box queries. It derives theoretical bounds on attack performance (fidelity gap, generalization error) and proposes metrics (MRC, VMA) and a framework (MER-Inspector) specifically to assess how vulnerable a given model architecture is to extraction/theft.


Details

Domains
vision
Model Types
cnntransformer
Threat Tags
black_boxinference_time
Datasets
5 unspecified datasets (experiments on 16 model architectures)
Applications
image classificationml-based web applications