benchmark 2026

Prior Aware Memorization: An Efficient Metric for Distinguishing Memorization from Generalization in Large Language Models

Trishita Tiwari 1, Ari Trachtenberg 2, G. Edward Suh 3,1

0 citations · 28 references · arXiv (Cornell University)

α

Published on arXiv

2602.18733

Model Inversion Attack

OWASP ML Top 10 — ML03

Sensitive Information Disclosure

OWASP LLM Top 10 — LLM06

Key Finding

55–90% of sequences labeled as memorized by existing metrics are statistically common sequences arising from generalization, not genuine training data memorization.

Prior-Aware Memorization (PAM)

Novel technique introduced


Training data leakage from Large Language Models (LLMs) raises serious concerns related to privacy, security, and copyright compliance. A central challenge in assessing this risk is distinguishing genuine memorization of training data from the generation of statistically common sequences. Existing approaches to measuring memorization often conflate these phenomena, labeling outputs as memorized even when they arise from generalization over common patterns. Counterfactual Memorization provides a principled solution by comparing models trained with and without a target sequence, but its reliance on retraining multiple baseline models makes it computationally expensive and impractical at scale. This work introduces Prior-Aware Memorization, a theoretically grounded, lightweight and training-free criterion for identifying genuine memorization in LLMs. The key idea is to evaluate whether a candidate suffix is strongly associated with its specific training prefix or whether it appears with high probability across many unrelated prompts due to statistical commonality. We evaluate this metric on text from the training corpora of two pre-trained models, LLaMA and OPT, using both long sequences (to simulate copyright risks) and named entities (to simulate PII leakage). Our results show that between 55% and 90% of sequences previously labeled as memorized are in fact statistically common. Similar findings hold for the SATML training data extraction challenge dataset, where roughly 40% of sequences exhibit common-pattern behavior despite appearing only once in the training data. These results demonstrate that low frequency alone is insufficient evidence of memorization and highlight the importance of accounting for model priors when assessing leakage.


Key Contributions

  • Introduces Prior-Aware Memorization (PAM), a training-free metric that distinguishes genuine LLM memorization from statistically common sequence generation by comparing suffix probability under the target prefix against its marginal probability across arbitrary prompts
  • Demonstrates empirically that 55–90% of sequences previously labeled as memorized in LLaMA and OPT are actually statistically common, not genuine training data leakage
  • Shows that low training-set frequency alone is insufficient evidence of memorization, with ~40% of single-occurrence sequences in the SATML challenge exhibiting common-pattern behavior

🛡️ Threat Analysis

Model Inversion Attack

Paper directly targets training data memorization/extraction from LLMs — measuring how much private training data (PII, copyrighted text) can be recovered from model outputs. Uses the adversarial SATML data extraction challenge as an evaluation benchmark, confirming the adversary-in-the-threat-model requirement.


Details

Domains
nlp
Model Types
llm
Threat Tags
inference_timeblack_box
Datasets
LLaMA training corpusOPT training corpusSATML training data extraction challenge
Applications
llm training data extractionpii leakage assessmentcopyright compliance auditing