How Private Are DNA Embeddings? Inverting Foundation Model Representations of Genomic Sequences
Sofiane Ouaari , Jules Kreuer , Nico Pfeifer
Published on arXiv
2603.06950
Model Inversion Attack
OWASP ML Top 10 — ML03
Key Finding
Per-token embeddings enable near-perfect DNA sequence reconstruction across all models; Evo 2 and NTv2 mean-pooled embeddings yield >90% reconstruction similarity for shorter sequences, while DNABERT-2's BPE tokenization provides the strongest privacy protection.
Training-based DNA embedding inversion (decoder network)
Novel technique introduced
DNA foundation models have become transformative tools in bioinformatics and healthcare applications. Trained on vast genomic datasets, these models can be used to generate sequence embeddings, dense vector representations that capture complex genomic information. These embeddings are increasingly being shared via Embeddings-as-a-Service (EaaS) frameworks to facilitate downstream tasks, while supposedly protecting the privacy of the underlying raw sequences. However, as this practice becomes more prevalent, the security of these representations is being called into question. This study evaluates the resilience of DNA foundation models to model inversion attacks, whereby adversaries attempt to reconstruct sensitive training data from model outputs. In our study, the model's output for reconstructing the DNA sequence is a zero-shot embedding, which is then fed to a decoder. We evaluated the privacy of three DNA foundation models: DNABERT-2, Evo 2, and Nucleotide Transformer v2 (NTv2). Our results show that per-token embeddings allow near-perfect sequence reconstruction across all models. For mean-pooled embeddings, reconstruction quality degrades as sequence length increases, though it remains substantially above random baselines. Evo 2 and NTv2 prove to be most vulnerable, especially for shorter sequences with reconstruction similarities > 90%, while DNABERT-2's BPE tokenization provides the greatest resilience. We found that the correlation between embedding similarity and sequence similarity was a key predictor of reconstruction success. Our findings emphasize the urgent need for privacy-aware design in genomic foundation models prior to their widespread deployment in EaaS settings. Training code, model weights and evaluation pipeline are released on: https://github.com/not-a-feature/DNA-Embedding-Inversion.
Key Contributions
- First systematic evaluation of model inversion attacks against DNA foundation models (DNABERT-2, Evo 2, NTv2) in an Embeddings-as-a-Service deployment scenario
- Shows per-token embeddings allow near-perfect DNA sequence reconstruction across all three models, while mean-pooled embeddings still achieve >90% similarity for shorter sequences in Evo 2 and NTv2
- Identifies BPE tokenization (DNABERT-2) as providing the greatest resilience, and establishes embedding-sequence similarity correlation as a key predictor of reconstruction success
🛡️ Threat Analysis
Core contribution is demonstrating training-based model inversion (embedding inversion) attacks — an adversary trains a decoder network to reconstruct genomic sequences from per-token or mean-pooled embeddings shared via EaaS. The adversary test is satisfied: a concrete attacker recovers private input data (DNA sequences) from model outputs (embeddings), which is the defining characteristic of ML03.