defense 2025

Data Cartography for Detecting Memorization Hotspots and Guiding Data Interventions in Generative Models

Laksh Patel , Neel Shanbhag

0 citations

α

Published on arXiv

2509.00083

Model Inversion Attack

OWASP ML Top 10 — ML03

Sensitive Information Disclosure

OWASP LLM Top 10 — LLM06

Key Finding

Reduces synthetic canary extraction success by over 40% with only 10% data pruning while increasing validation perplexity by less than 0.5%.

GenDataCarto

Novel technique introduced


Modern generative models risk overfitting and unintentionally memorizing rare training examples, which can be extracted by adversaries or inflate benchmark performance. We propose Generative Data Cartography (GenDataCarto), a data-centric framework that assigns each pretraining sample a difficulty score (early-epoch loss) and a memorization score (frequency of ``forget events''), then partitions examples into four quadrants to guide targeted pruning and up-/down-weighting. We prove that our memorization score lower-bounds classical influence under smoothness assumptions and that down-weighting high-memorization hotspots provably decreases the generalization gap via uniform stability bounds. Empirically, GenDataCarto reduces synthetic canary extraction success by over 40\% at just 10\% data pruning, while increasing validation perplexity by less than 0.5\%. These results demonstrate that principled data interventions can dramatically mitigate leakage with minimal cost to generative performance.


Key Contributions

  • GenDataCarto framework that assigns each pretraining sample a difficulty score (early-epoch loss) and memorization score (forget event frequency) to guide targeted pruning and reweighting
  • Theoretical proof that the memorization score lower-bounds classical influence functions under smoothness/convexity assumptions, with a uniform stability bound showing down-weighting hotspots reduces the generalization gap
  • Empirical demonstration of >40% reduction in synthetic canary extraction success at 10% data pruning with <0.5% perplexity increase on LSTM and GPT-2

🛡️ Threat Analysis

Model Inversion Attack

The primary contribution defends against adversarial extraction of memorized training data from generative models — the canonical ML03 threat. The key evaluation metric is synthetic canary extraction success rate, directly measuring an adversary's ability to reconstruct training samples from model outputs. The framework (GenDataCarto) reduces this extraction success by >40% through data-centric interventions at training time.


Details

Domains
nlpgenerative
Model Types
llmrnn
Threat Tags
training_timeblack_boxwhite_box
Datasets
Wikitext-103
Applications
language model pretraininggenerative model trainingtraining data privacy