Generative clinical time series models trained on moderate amounts of patient data are privacy preserving
Rustam Zhumagambetov 1, Niklas Giesa 2, Sebastian D. Boie 2, Stefan Haufe 3,1,2
Published on arXiv
2602.10631
Membership Inference Attack
OWASP ML Top 10 — ML04
Model Inversion Attack
OWASP ML Top 10 — ML03
Key Finding
Established privacy attacks (membership inference, memorization-based) are ineffective against multivariate clinical time series generative models trained on large enough datasets, rendering differential privacy unnecessary and counterproductive for utility.
Sharing medical data for machine learning model training purposes is often impossible due to the risk of disclosing identifying information about individual patients. Synthetic data produced by generative artificial intelligence (genAI) models trained on real data is often seen as one possible solution to comply with privacy regulations. While powerful genAI models for heterogeneous hospital time series have recently been introduced, such modeling does not guarantee privacy protection, as the generated data may still reveal identifying information about individuals in the models' training cohort. Applying established privacy mechanisms to generative time series models, however, proves challenging as post-hoc data anonymization through k-anonymization or similar techniques is limited, while model-centered privacy mechanisms that implement differential privacy (DP) may lead to unstable training, compromising the utility of generated data. Given these known limitations, privacy audits for generative time series models are currently indispensable regardless of the concrete privacy mechanisms applied to models and/or data. In this work, we use a battery of established privacy attacks to audit state-of-the-art hospital time series models, trained on the public MIMIC-IV dataset, with respect to privacy preservation. Furthermore, the eICU dataset was used to mount a privacy attack against the synthetic data generator trained on the MIMIC-IV dataset. Results show that established privacy attacks are ineffective against generated multivariate clinical time series when synthetic data generators are trained on large enough training datasets. Furthermore, we discuss how the use of existing DP mechanisms for these synthetic data generators would not bring desired improvement in privacy, but only a decrease in utility for machine learning prediction tasks.
Key Contributions
- Systematic privacy audit of state-of-the-art hospital time series generative models (trained on MIMIC-IV) using a battery of established privacy attacks including membership inference
- Empirical demonstration that privacy attacks are ineffective when synthetic data generators are trained on sufficiently large patient datasets
- Analysis showing that differential privacy mechanisms would not meaningfully improve privacy for these models but would degrade utility for downstream ML prediction tasks
🛡️ Threat Analysis
The paper also addresses dataset memorization and reconstruction attacks (the generated data revealing identifying information about training individuals), which corresponds to model inversion / training data reconstruction — the adversary tries to recover private patient records from model outputs.
The paper explicitly mounts membership inference attacks against generative time series models to determine whether individual patient records can be identified in the training cohort — the classic ML04 threat model applied to generative AI.