Learning the Signature of Memorization in Autoregressive Language Models
David Ilić , Kostadin Cvejoski , David Stanojević , Evgeny Grigorenko
Published on arXiv
2604.03199
Membership Inference Attack
OWASP ML Top 10 — ML04
Key Finding
Achieves 0.963 AUC on Mamba, 0.972 on RWKV-4, 0.936 on RecurrentGemma via zero-shot transfer from transformer-only training, exceeding held-out transformer performance (0.908 AUC)
LT-MIA (Learned Transfer MIA)
Novel technique introduced
All prior membership inference attacks for fine-tuned language models use hand-crafted heuristics (e.g., loss thresholding, Min-K\%, reference calibration), each bounded by the designer's intuition. We introduce the first transferable learned attack, enabled by the observation that fine-tuning any model on any corpus yields unlimited labeled data, since membership is known by construction. This removes the shadow model bottleneck and brings membership inference into the deep learning era: learning what matters rather than designing it, with generalization through training diversity and scale. We discover that fine-tuning language models produces an invariant signature of memorization detectable across architectural families and data domains. We train a membership inference classifier exclusively on transformer-based models. It transfers zero-shot to Mamba (state-space), RWKV-4 (linear attention), and RecurrentGemma (gated recurrence), achieving 0.963, 0.972, and 0.936 AUC respectively. Each evaluation combines an architecture and dataset never seen during training, yet all three exceed performance on held-out transformers (0.908 AUC). These four families share no computational mechanisms, their only commonality is gradient descent on cross-entropy loss. Even simple likelihood-based methods exhibit strong transfer, confirming the signature exists independently of the detection method. Our method, Learned Transfer MIA (LT-MIA), captures this signal most effectively by reframing membership inference as sequence classification over per-token distributional statistics. On transformers, LT-MIA achieves 2.8$\times$ higher TPR at 0.1\% FPR than the strongest baseline. The method also transfers to code (0.865 AUC) despite training only on natural language texts. Code and trained classifier available at https://github.com/JetBrains-Research/learned-mia.
Key Contributions
- First transferable learned MIA for language models, removing shadow model bottleneck by treating fine-tuning as unlimited labeled data source
- Discovers invariant memorization signature across architectures (transformers, Mamba, RWKV, RecurrentGemma) with zero-shot transfer achieving 0.936-0.972 AUC
- Reframes MIA as sequence classification over per-token distributional statistics, achieving 2.8× higher TPR at 0.1% FPR than strongest baseline
🛡️ Threat Analysis
Core contribution is a membership inference attack (MIA) determining whether specific texts were in a model's training data. The paper trains a classifier to detect membership by learning distributional signatures of memorization across fine-tuned language models.