defense 2025

Assessing and Mitigating Data Memorization Risks in Fine-Tuned Large Language Models

Badrinath Ramakrishnan , Akshaya Balaji

0 citations

α

Published on arXiv

2508.14062

Model Inversion Attack

OWASP ML Top 10 — ML03

Sensitive Information Disclosure

OWASP LLM Top 10 — LLM06

Key Finding

Four complementary privacy protection methods collectively reduce training data leakage from up to 75% to 0% while preserving 94.7% of original model utility across tested architectures.


Large Language Models (LLMs) have demonstrated remarkable capabilities across diverse natural language processing tasks, but their tendency to memorize training data poses significant privacy risks, particularly during fine-tuning processes. This paper presents a comprehensive empirical analysis of data memorization in fine-tuned LLMs and introduces a novel multi-layered privacy protection framework. Through controlled experiments on modern LLM architectures including GPT-2, Phi-3, and Gemma-2, we demonstrate that fine-tuning with repeated sensitive data increases privacy leakage rates from baseline levels of 0-5% to 60-75%, representing a 64.2% average increase across tested models. We propose and rigorously evaluate four complementary privacy protection methods: semantic data deduplication, differential privacy during generation, entropy-based filtering, and pattern-based content filtering. Our experimental results show that these techniques can reduce data leakage to 0% while maintaining 94.7% of original model utility.


Key Contributions

  • Empirical quantification showing fine-tuning on repeated sensitive data increases LLM memorization leakage rates from 0–5% baseline to 60–75% across GPT-2, Phi-3, and Gemma-2
  • Multi-layered privacy protection framework combining semantic deduplication, differential privacy at generation time, entropy-based filtering, and pattern-based content filtering
  • Open-source experimental infrastructure enabling practitioners to assess and mitigate memorization risks in their own fine-tuned LLMs

🛡️ Threat Analysis

Model Inversion Attack

The paper's primary threat model is an adversary extracting memorized training data from fine-tuned LLMs — directly measuring verbatim reproduction rates and proposing defenses against reconstruction. This is training data extraction (model inversion), not just compliance-motivated privacy.


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
training_timeinference_timeblack_box
Datasets
GPT-2Phi-3Gemma-2
Applications
llm fine-tuningsensitive data handlingenterprise llm deployment