defense 2025

Randomized Masked Finetuning: An Efficient Way to Mitigate Memorization of PIIs in LLMs

Kunj Joshi , David A. Smith

0 citations · 30 references · arXiv

α

Published on arXiv

2512.03310

Model Inversion Attack

OWASP ML Top 10 — ML03

Sensitive Information Disclosure

OWASP LLM Top 10 — LLM06

Key Finding

RMFT achieves 80.81% reduction in Total Extraction Rate and 80.17% reduction in Seen Extraction Rate versus baseline fine-tuning, with only a 5.73% perplexity increase on GPT-2 XL.

Randomized Masked Fine-Tuning (RMFT)

Novel technique introduced


The current literature on memorization in Natural Language Models, especially Large Language Models (LLMs), poses severe security and privacy risks, as models tend to memorize personally identifying information (PIIs) from training data. We introduce Randomized Masked Fine-Tuning (RMFT), a novel privacy-preserving fine-tuning technique that reduces PII memorization while minimizing performance impact. Using the Enron Email Dataset, we demonstrate that RMFT achieves an 80.81% reduction in Total Extraction Rate and 80.17% reduction in Seen Extraction Rate compared to baseline fine-tuning, outperforming deduplication methods while maintaining only a 5.73% increase in perplexity. We present MaxTER, a Pareto-optimal evaluation framework for assessing privacy-utility tradeoffs, and show the performance of RMFT vs Deduplication by Area Under The Response Curve (AURC) metric.


Key Contributions

  • RMFT: a fine-tuning technique that preserves only the first occurrence of each PII in training data and masks duplicates with structurally similar synthetic values, reducing memorization without removing data
  • MaxTER: a Pareto-optimal evaluation framework characterizing the privacy-utility tradeoff using Total Extraction Rate (TER), Seen Extraction Rate (SER), and Mean Delta Perplexity (MDP) with an AURC comparison metric
  • Empirical demonstration on the Enron Email Dataset with GPT-2 XL and GPT-Neo-1.3B showing 80%+ TER reduction at only ~6% perplexity cost, outperforming deduplication on the privacy-utility tradeoff

🛡️ Threat Analysis

Model Inversion Attack

The core threat is an adversary extracting private training data (email addresses) from the LLM using targeted prompts — a classic training data reconstruction/extraction attack. RMFT directly defends against this by reducing PII memorization during fine-tuning, and extraction rates are measured against an adversarial extraction protocol adapted from Carlini et al.


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
training_timeinference_timeblack_box
Datasets
Enron Email Dataset
Applications
large language model fine-tuningpii protectionemail data processing