survey 2025

Synthetic Data Privacy Metrics

Amy Steier , Lipika Ramaswamy , Andre Manoel , Alexa Haushalter

0 citations

α

Published on arXiv

2501.03941

Membership Inference Attack

OWASP ML Top 10 — ML04

Model Inversion Attack

OWASP ML Top 10 — ML03

Key Finding

No standardized empirical privacy metric exists for synthetic data; adversarial simulation metrics (MIA, AIA) are among the most rigorous but must be complemented by distance-based metrics given the privacy-utility trade-off.


Recent advancements in generative AI have made it possible to create synthetic datasets that can be as accurate as real-world data for training AI models, powering statistical insights, and fostering collaboration with sensitive datasets while offering strong privacy guarantees. Effectively measuring the empirical privacy of synthetic data is an important step in the process. However, while there is a multitude of new privacy metrics being published every day, there currently is no standardization. In this paper, we review the pros and cons of popular metrics that include simulations of adversarial attacks. We also review current best practices for amending generative models to enhance the privacy of the data they create (e.g. differential privacy).


Key Contributions

  • Reviews and compares popular synthetic data privacy metrics (k-anonymity, DCR, NNDR, NNAA, exact match, PII replay) with emphasis on adversarial attack-based simulation metrics
  • Surveys membership inference and attribute inference attack methodologies as empirical privacy evaluation tools for synthetic tabular, text, and image data
  • Reviews defenses and best practices for enhancing synthetic data privacy including differential privacy and PII scrubbing

🛡️ Threat Analysis

Model Inversion Attack

Attribute inference attacks (AIAs) — which recover private attributes about individuals from model outputs — and memorization/PII leakage from generative models are reviewed, mapping to model inversion and training data reconstruction threats.

Membership Inference Attack

Membership inference attacks (MIAs) are explicitly reviewed as a primary privacy metric category for evaluating whether training data records can be identified in synthetic data — directly addressing the ML04 threat.


Details

Domains
tabularnlpgenerative
Model Types
generativegan
Threat Tags
black_boxinference_time
Applications
synthetic data generationprivacy-preserving data sharingtabular data synthesis