Quantifying Memorization and Privacy Risks in Genomic Language Models
Alexander Nemecek 1, Wenbiao Li 1, Xiaoqian Jiang 2, Jaideep Vaidya 3, Erman Ayday 1
Published on arXiv
2603.08913
Model Inversion Attack
OWASP ML Top 10 — ML03
Membership Inference Attack
OWASP ML Top 10 — ML04
Key Finding
GLMs exhibit measurable memorization that varies across architectures and training regimes, and no single attack vector (perplexity, canary extraction, or MIA) alone captures the full scope of memorization risk.
Genomic language models (GLMs) have emerged as powerful tools for learning representations of DNA sequences, enabling advances in variant prediction, regulatory element identification, and cross-task transfer learning. However, as these models are increasingly trained or fine-tuned on sensitive genomic cohorts, they risk memorizing specific sequences from their training data, raising serious concerns around privacy, data leakage, and regulatory compliance. Despite growing awareness of memorization risks in general-purpose language models, little systematic evaluation exists for these risks in the genomic domain, where data exhibit unique properties such as a fixed nucleotide alphabet, strong biological structure, and individual identifiability. We present a comprehensive, multi-vector privacy evaluation framework designed to quantify memorization risks in GLMs. Our approach integrates three complementary risk assessment methodologies: perplexity-based detection, canary sequence extraction, and membership inference. These are combined into a unified evaluation pipeline that produces a worst-case memorization risk score. To enable controlled evaluation, we plant canary sequences at varying repetition rates into both synthetic and real genomic datasets, allowing precise quantification of how repetition and training dynamics influence memorization. We evaluate our framework across multiple GLM architectures, examining the relationship between sequence repetition, model capacity, and memorization risk. Our results establish that GLMs exhibit measurable memorization and that the degree of memorization varies across architectures and training regimes. These findings reveal that no single attack vector captures the full scope of memorization risk, underscoring the need for multi-vector privacy auditing as a standard practice for genomic AI systems.
Key Contributions
- Unified multi-vector privacy evaluation pipeline combining perplexity-based detection, canary sequence extraction, and membership inference into a worst-case memorization risk score for genomic language models
- Controlled canary planting methodology at varying repetition rates in both synthetic and real genomic datasets, enabling precise quantification of how repetition and training dynamics drive memorization
- Cross-architecture empirical evaluation demonstrating that memorization risk in GLMs varies significantly across architectures and training regimes, and that no single attack vector captures the full risk
🛡️ Threat Analysis
Canary sequence extraction is a core attack vector in the framework — an adversary reconstructs memorized training sequences from GLM outputs, which is training data reconstruction. The perplexity-based detection also probes whether specific sequences were memorized (recoverable from the model).
Membership inference is one of the three primary attack vectors in the evaluation framework, determining whether specific genomic sequences were present in the training set — the canonical ML04 threat.