Randomized Masked Finetuning: An Efficient Way to Mitigate Memorization of PIIs in LLMs
Published on arXiv
2512.03310
Model Inversion Attack
OWASP ML Top 10 — ML03
Sensitive Information Disclosure
OWASP LLM Top 10 — LLM06
Key Finding
RMFT achieves 80.81% reduction in Total Extraction Rate and 80.17% reduction in Seen Extraction Rate versus baseline fine-tuning, with only a 5.73% perplexity increase on GPT-2 XL.
Randomized Masked Fine-Tuning (RMFT)
Novel technique introduced
The current literature on memorization in Natural Language Models, especially Large Language Models (LLMs), poses severe security and privacy risks, as models tend to memorize personally identifying information (PIIs) from training data. We introduce Randomized Masked Fine-Tuning (RMFT), a novel privacy-preserving fine-tuning technique that reduces PII memorization while minimizing performance impact. Using the Enron Email Dataset, we demonstrate that RMFT achieves an 80.81% reduction in Total Extraction Rate and 80.17% reduction in Seen Extraction Rate compared to baseline fine-tuning, outperforming deduplication methods while maintaining only a 5.73% increase in perplexity. We present MaxTER, a Pareto-optimal evaluation framework for assessing privacy-utility tradeoffs, and show the performance of RMFT vs Deduplication by Area Under The Response Curve (AURC) metric.
Key Contributions
- RMFT: a fine-tuning technique that preserves only the first occurrence of each PII in training data and masks duplicates with structurally similar synthetic values, reducing memorization without removing data
- MaxTER: a Pareto-optimal evaluation framework characterizing the privacy-utility tradeoff using Total Extraction Rate (TER), Seen Extraction Rate (SER), and Mean Delta Perplexity (MDP) with an AURC comparison metric
- Empirical demonstration on the Enron Email Dataset with GPT-2 XL and GPT-Neo-1.3B showing 80%+ TER reduction at only ~6% perplexity cost, outperforming deduplication on the privacy-utility tradeoff
🛡️ Threat Analysis
The core threat is an adversary extracting private training data (email addresses) from the LLM using targeted prompts — a classic training data reconstruction/extraction attack. RMFT directly defends against this by reducing PII memorization during fine-tuning, and extraction rates are measured against an adversarial extraction protocol adapted from Carlini et al.