Secure Cross-Silo Synthetic Genomic Data Generation
Daniil Filienko , Martine De Cock , Sikha Pentyala
Published on arXiv
2604.27456
Model Inversion Attack
OWASP ML Top 10 — ML03
Key Finding
Enables high-utility synthetic genomic data generation across federated institutions while protecting patient privacy through MPC input privacy and DP output privacy
Access to genomic data is highly regulated due to its sensitive nature. While safeguards are essential, cumbersome data access processes pose a significant barrier to the development of AI methods for genomics. Synthetic data generation can mitigate this tension by enabling broader data sharing without exposing sensitive information. Synthetic genomic data are produced by training generative models on real data and subsequently sampling artificial data that preserves relevant statistics while limiting disclosures about the underlying individuals. In some settings, a single data holder may have sufficient data to train such generative models; however, in many applications data must be combined across multiple sites to achieve adequate scale. This need arises, e.g., in rare disease studies, where individual hospitals typically hold data for only a small number of patients. The solution we present in this paper enables multiple data holders to jointly train a synthetic data generator without revealing their raw data. Our approach combines secure multiparty computation (MPC) to ensure input privacy, so that no party ever discloses its data in unencrypted form, with differential privacy (DP) to provide output privacy by mitigating information leakage from the released synthetic data. We empirically demonstrate the effectiveness of the proposed method by generating high-utility synthetic datasets from multiple real RNA-seq cohorts in federated settings, showing that our approach enables privacy-preserving data synthesis even when data are distributed across institutions.
Key Contributions
- Combines secure multiparty computation (MPC) with differential privacy (DP) to enable federated training of synthetic genomic data generators without exposing raw data
- Demonstrates practical privacy-preserving synthetic RNA-seq data generation across distributed institutions
- Empirically validates approach on real RNA-seq cohorts showing high-utility synthetic data generation in federated settings
🛡️ Threat Analysis
The paper explicitly addresses protecting private training data (genomic data) from exposure. It uses MPC to ensure input privacy (no party discloses data in unencrypted form) and differential privacy to prevent information leakage from synthetic outputs. The threat model involves an adversary trying to extract sensitive genomic information from either the training process or the released synthetic data. The combination of MPC + DP directly defends against data reconstruction attacks.