tool 2025

Synth-MIA: A Testbed for Auditing Privacy Leakage in Tabular Data Synthesis

Joshua Ward , Xiaofeng Lin , Chi-Hua Wang , Guang Cheng

6 citations · 53 references · arXiv

α

Published on arXiv

2509.18014

Membership Inference Attack

OWASP ML Top 10 — ML04

Key Finding

Higher synthetic data quality corresponds to greater privacy leakage, similarity-based metrics weakly correlate with MIA results, and differentially private generator PATEGAN can still fail to preserve privacy under MIA-based auditing.

Synth-MIA

Novel technique introduced


Tabular Generative Models are often argued to preserve privacy by creating synthetic datasets that resemble training data. However, auditing their empirical privacy remains challenging, as commonly used similarity metrics fail to effectively characterize privacy risk. Membership Inference Attacks (MIAs) have recently emerged as a method for evaluating privacy leakage in synthetic data, but their practical effectiveness is limited. Numerous attacks exist across different threat models, each with distinct implementations targeting various sources of privacy leakage, making them difficult to apply consistently. Moreover, no single attack consistently outperforms the others, leading to a routine underestimation of privacy risk. To address these issues, we propose a unified, model-agnostic threat framework that deploys a collection of attacks to estimate the maximum empirical privacy leakage in synthetic datasets. We introduce Synth-MIA, an open-source Python library that streamlines this auditing process through a novel testbed that integrates seamlessly into existing synthetic data evaluation pipelines through a Scikit-Learn-like API. Our software implements 13 attack methods through a Scikit-Learn-like API, designed to enable fast systematic estimation of privacy leakage for practitioners as well as facilitate the development of new attacks and experiments for researchers. We demonstrate our framework's utility in the largest tabular synthesis privacy benchmark to date, revealing that higher synthetic data quality corresponds to greater privacy leakage, that similarity-based privacy metrics show weak correlation with MIA results, and that the differentially private generator PATEGAN can fail to preserve privacy under such attacks. This underscores the necessity of MIA-based auditing when designing and deploying Tabular Generative Models.


Key Contributions

  • Synth-MIA: an open-source Python library with scikit-learn-like API implementing 13 MIA methods for auditing tabular generative models
  • A unified, model-agnostic threat framework that deploys an ensemble of attacks to estimate maximum empirical privacy leakage in synthetic datasets
  • Largest tabular synthesis privacy benchmark to date, revealing that higher synthetic data quality correlates with greater privacy leakage and that similarity-based metrics poorly capture MIA risk

🛡️ Threat Analysis

Membership Inference Attack

The entire paper is about membership inference attacks (MIAs) on tabular generative models — implementing 13 MIA methods, proposing a unified framework to estimate maximum empirical privacy leakage, and benchmarking MIA performance across multiple tabular synthesizers including PATEGAN.


Details

Domains
tabulargenerative
Model Types
gandiffusiontraditional_ml
Threat Tags
black_boxinference_time
Datasets
HealthcareFinanceEducation tabular datasets
Applications
tabular data synthesissynthetic data generationprivacy auditing