benchmark 2026

Personal Information Parroting in Language Models

Nishant Subramani 1, Kshitish Ghate 2, Mona Diab 1

0 citations · 20 references · arXiv (Cornell University)

α

Published on arXiv

2602.20580

Model Inversion Attack

OWASP ML Top 10 — ML03

Sensitive Information Disclosure

OWASP LLM Top 10 — LLM06

Key Finding

Pythia-6.9B verbatim reproduces 13.6% of personal information instances when prompted with their preceding context; even Pythia-160M leaks 2.7%, with memorization monotonically increasing with both model scale and training duration.

R&R (Regexes and Rules) detector

Novel technique introduced


Modern language models (LM) are trained on large scrapes of the Web, containing millions of personal information (PI) instances, many of which LMs memorize, increasing privacy risks. In this work, we develop the regexes and rules (R&R) detector suite to detect email addresses, phone numbers, and IP addresses, which outperforms the best regex-based PI detectors. On a manually curated set of 483 instances of PI, we measure memorization: finding that 13.6% are parroted verbatim by the Pythia-6.9b model, i.e., when the model is prompted with the tokens that precede the PI in the original document, greedy decoding generates the entire PI span exactly. We expand this analysis to study models of varying sizes (160M-6.9B) and pretraining time steps (70k-143k iterations) in the Pythia model suite and find that both model size and amount of pretraining are positively correlated with memorization. Even the smallest model, Pythia-160m, parrots 2.7% of the instances exactly. Consequently, we strongly recommend that pretraining datasets be aggressively filtered and anonymized to minimize PI parroting.


Key Contributions

  • R&R detector suite (regexes and rules) for identifying emails, phone numbers, and IP addresses that outperforms prior regex-based PII detectors
  • Empirical measurement of verbatim PI parroting on 483 manually curated instances, finding 13.6% exact reproduction rate in Pythia-6.9B via greedy decoding from preceding context
  • Scaling analysis across Pythia model sizes (160M–6.9B) and pretraining steps (70k–143k), demonstrating both positively correlate with PI memorization

🛡️ Threat Analysis

Model Inversion Attack

Core contribution is measuring training data extraction: when prompted with tokens preceding a PI span in the original pretraining document, the model reproduces personal information verbatim — a direct instantiation of training data reconstruction from a language model.


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
inference_timeblack_boxtraining_time
Datasets
The PilePythia model suite
Applications
language model pretrainingpii leakage measurement