Prior Aware Memorization: An Efficient Metric for Distinguishing Memorization from Generalization in Large Language Models
Trishita Tiwari 1, Ari Trachtenberg 2, G. Edward Suh 3,1
Published on arXiv
2602.18733
Model Inversion Attack
OWASP ML Top 10 — ML03
Sensitive Information Disclosure
OWASP LLM Top 10 — LLM06
Key Finding
55–90% of sequences labeled as memorized by existing metrics are statistically common sequences arising from generalization, not genuine training data memorization.
Prior-Aware Memorization (PAM)
Novel technique introduced
Training data leakage from Large Language Models (LLMs) raises serious concerns related to privacy, security, and copyright compliance. A central challenge in assessing this risk is distinguishing genuine memorization of training data from the generation of statistically common sequences. Existing approaches to measuring memorization often conflate these phenomena, labeling outputs as memorized even when they arise from generalization over common patterns. Counterfactual Memorization provides a principled solution by comparing models trained with and without a target sequence, but its reliance on retraining multiple baseline models makes it computationally expensive and impractical at scale. This work introduces Prior-Aware Memorization, a theoretically grounded, lightweight and training-free criterion for identifying genuine memorization in LLMs. The key idea is to evaluate whether a candidate suffix is strongly associated with its specific training prefix or whether it appears with high probability across many unrelated prompts due to statistical commonality. We evaluate this metric on text from the training corpora of two pre-trained models, LLaMA and OPT, using both long sequences (to simulate copyright risks) and named entities (to simulate PII leakage). Our results show that between 55% and 90% of sequences previously labeled as memorized are in fact statistically common. Similar findings hold for the SATML training data extraction challenge dataset, where roughly 40% of sequences exhibit common-pattern behavior despite appearing only once in the training data. These results demonstrate that low frequency alone is insufficient evidence of memorization and highlight the importance of accounting for model priors when assessing leakage.
Key Contributions
- Introduces Prior-Aware Memorization (PAM), a training-free metric that distinguishes genuine LLM memorization from statistically common sequence generation by comparing suffix probability under the target prefix against its marginal probability across arbitrary prompts
- Demonstrates empirically that 55–90% of sequences previously labeled as memorized in LLaMA and OPT are actually statistically common, not genuine training data leakage
- Shows that low training-set frequency alone is insufficient evidence of memorization, with ~40% of single-occurrence sequences in the SATML challenge exhibiting common-pattern behavior
🛡️ Threat Analysis
Paper directly targets training data memorization/extraction from LLMs — measuring how much private training data (PII, copyrighted text) can be recovered from model outputs. Uses the adversarial SATML data extraction challenge as an evaluation benchmark, confirming the adversary-in-the-threat-model requirement.