Prior Aware Memorization: An Efficient Metric for Distinguishing Memorization from Generalization in Large Language Models

Training data leakage from Large Language Models (LLMs) raises serious concerns related to privacy, security, and copyright compliance. A central challenge in assessing this risk is distinguishing genuine memorization of training data from the generation of statistically common sequences. Existing approaches to measuring memorization often conflate these phenomena, labeling outputs as memorized even when they arise from generalization over common patterns. Counterfactual Memorization provides a principled solution by comparing models trained with and without a target sequence, but its reliance on retraining multiple baseline models makes it computationally expensive and impractical at scale. This work introduces Prior-Aware Memorization, a theoretically grounded, lightweight and training-free criterion for identifying genuine memorization in LLMs. The key idea is to evaluate whether a candidate suffix is strongly associated with its specific training prefix or whether it appears with high probability across many unrelated prompts due to statistical commonality. We evaluate this metric on text from the training corpora of two pre-trained models, LLaMA and OPT, using both long sequences (to simulate copyright risks) and named entities (to simulate PII leakage). Our results show that between 55% and 90% of sequences previously labeled as memorized are in fact statistically common. Similar findings hold for the SATML training data extraction challenge dataset, where roughly 40% of sequences exhibit common-pattern behavior despite appearing only once in the training data. These results demonstrate that low frequency alone is insufficient evidence of memorization and highlight the importance of accounting for model priors when assessing leakage.

Key Contributions

Introduces Prior-Aware Memorization (PAM), a training-free metric that distinguishes genuine LLM memorization from statistically common sequence generation by comparing suffix probability under the target prefix against its marginal probability across arbitrary prompts
Demonstrates empirically that 55–90% of sequences previously labeled as memorized in LLaMA and OPT are actually statistically common, not genuine training data leakage
Shows that low training-set frequency alone is insufficient evidence of memorization, with ~40% of single-occurrence sequences in the SATML challenge exhibiting common-pattern behavior

🛡️ Threat Analysis

Model Inversion Attack

Paper directly targets training data memorization/extraction from LLMs — measuring how much private training data (PII, copyrighted text) can be recovered from model outputs. Uses the adversarial SATML data extraction challenge as an evaluation benchmark, confirming the adversary-in-the-threat-model requirement.

Details

Domains

nlp

Model Types

llm

Threat Tags

inference_timeblack_box

Datasets

LLaMA training corpusOPT training corpusSATML training data extraction challenge

Applications

2026 0 cit.

Model Inversion Attack

86%