attack 2025

SMOTE and Mirrors: Exposing Privacy Leakage from Synthetic Minority Oversampling

Georgi Ganev 1,2, Reza Nazari 1, Rees Davison 1, Amir Dizche 1, Xinmin Wu 1, Ralph Abbey 1, Jorge Silva 1, Emiliano De Cristofaro 3

2 citations · 63 references · arXiv

α

Published on arXiv

2510.15083

Model Inversion Attack

OWASP ML Top 10 — ML03

Membership Inference Attack

OWASP ML Top 10 — ML04

Key Finding

ReconSMOTE reconstructs real minority records with precision and recall approaching 1.0, while DistinSMOTE achieves perfect distinguishing (1.00 ± 0.00) versus 0.01 ± 0.01 for naive evaluation baselines

DistinSMOTE / ReconSMOTE

Novel technique introduced


The Synthetic Minority Over-sampling Technique (SMOTE) is one of the most widely used methods for addressing class imbalance and generating synthetic data. Despite its popularity, little attention has been paid to its privacy implications; yet, it is used in the wild in many privacy-sensitive applications. In this work, we conduct the first systematic study of privacy leakage in SMOTE: We begin by showing that prevailing evaluation practices, i.e., naive distinguishing and distance-to-closest-record metrics, completely fail to detect any leakage and that membership inference attacks (MIAs) can be instantiated with high accuracy. Then, by exploiting SMOTE's geometric properties, we build two novel attacks with very limited assumptions: DistinSMOTE, which perfectly distinguishes real from synthetic records in augmented datasets, and ReconSMOTE, which reconstructs real minority records from synthetic datasets with perfect precision and recall approaching one under realistic imbalance ratios. We also provide theoretical guarantees for both attacks. Experiments on eight standard imbalanced datasets confirm the practicality and effectiveness of these attacks. Overall, our work reveals that SMOTE is inherently non-private and disproportionately exposes minority records, highlighting the need to reconsider its use in privacy-sensitive applications.


Key Contributions

  • DistinSMOTE: a novel attack that perfectly distinguishes real from synthetic records in SMOTE-augmented datasets by exploiting SMOTE's geometric interpolation properties, with theoretical guarantees
  • ReconSMOTE: a novel attack that reconstructs real minority training records from purely synthetic SMOTE datasets with precision and recall approaching 1.0 under realistic imbalance ratios
  • First systematic privacy study of SMOTE, showing that standard evaluation metrics (naive distinguishing, DCR) completely fail to detect leakage, while MIAs achieve 0.93 AUC on synthetic data

🛡️ Threat Analysis

Model Inversion Attack

ReconSMOTE reconstructs real minority training records from SMOTE-generated synthetic datasets with perfect precision and recall approaching one — a direct data reconstruction attack exploiting the geometry of the generative process.

Membership Inference Attack

The paper instantiates membership inference attacks against SMOTE and proposes DistinSMOTE, which perfectly distinguishes real from synthetic records in augmented datasets (1.00 ± 0.00), determining membership of real training records.


Details

Domains
tabular
Model Types
traditional_mlgenerative
Threat Tags
black_boxtraining_timetargeted
Datasets
8 standard imbalanced tabular datasets
Applications
tabular synthetic data generationclass imbalance oversamplingmedical data augmentationfraud detection