defense 2025

RareGraph-Synth: Knowledge-Guided Diffusion Models for Generating Privacy-Preserving Synthetic Patient Trajectories in Ultra-Rare Diseases

Khartik Uppalapati , Shakeel Abdulkareem , Bora Yimenicioglu

0 citations · 29 references · arXiv

α

Published on arXiv

2510.06267

Membership Inference Attack

OWASP ML Top 10 — ML04

Key Finding

DOMIAS membership inference AUROC ≈0.53 on RareGraph-Synth synthetic EHR data, below the 0.55 safe-release threshold and substantially lower than non-KG baselines (AUROC ≈0.61±0.03)

RareGraph-Synth

Novel technique introduced


We propose RareGraph-Synth, a knowledge-guided, continuous-time diffusion framework that generates realistic yet privacy-preserving synthetic electronic-health-record (EHR) trajectories for ultra-rare diseases. RareGraph-Synth unifies five public resources: Orphanet/Orphadata, the Human Phenotype Ontology (HPO), the GARD rare-disease KG, PrimeKG, and the FDA Adverse Event Reporting System (FAERS) into a heterogeneous knowledge graph comprising approximately 8 M typed edges. Meta-path scores extracted from this 8-million-edge KG modulate the per-token noise schedule in the forward stochastic differential equation, steering generation toward biologically plausible lab-medication-adverse-event co-occurrences while retaining score-based diffusion model stability. The reverse denoiser then produces timestamped sequences of lab-code, medication-code, and adverse-event-flag triples that contain no protected health information. On simulated ultra-rare-disease cohorts, RareGraph-Synth lowers categorical Maximum Mean Discrepancy by 40 percent relative to an unguided diffusion baseline and by greater than 60 percent versus GAN counterparts, without sacrificing downstream predictive utility. A black-box membership-inference evaluation using the DOMIAS attacker yields AUROC approximately 0.53, well below the 0.55 safe-release threshold and substantially better than the approximately 0.61 plus or minus 0.03 observed for non-KG baselines, demonstrating strong resistance to re-identification. These results suggest that integrating biomedical knowledge graphs directly into diffusion noise schedules can simultaneously enhance fidelity and privacy, enabling safer data sharing for rare-disease research.


Key Contributions

  • RareGraph-Synth: a continuous-time diffusion framework that conditions the per-token noise schedule on meta-path scores extracted from an 8M-edge heterogeneous biomedical knowledge graph (Orphanet, HPO, GARD, PrimeKG, FAERS)
  • Achieves AUROC ≈0.53 against the DOMIAS membership inference attacker — below the 0.55 safe-release threshold — significantly outperforming non-KG baselines (AUROC ≈0.61) on privacy
  • Reduces categorical Maximum Mean Discrepancy by 40% vs. unguided diffusion and >60% vs. GAN baselines while preserving downstream predictive utility

🛡️ Threat Analysis

Membership Inference Attack

The paper's security claim is explicitly framed as resistance to membership inference: the DOMIAS black-box attacker achieves AUROC ≈0.53 (vs ≈0.61 for non-KG baselines), and the KG-guided noise schedule is specifically credited with improving MIA resistance. The 0.55 AUROC safe-release threshold is the key benchmark used to validate the defense.


Details

Domains
generativegraph
Model Types
diffusiongnn
Threat Tags
black_boxtraining_time
Datasets
Orphanet/OrphadataHuman Phenotype Ontology (HPO)GARD rare-disease KGPrimeKGFAERSsimulated ultra-rare-disease cohorts
Applications
electronic health recordssynthetic patient data generationrare disease research