benchmark 2026

Privacy Collapse: Benign Fine-Tuning Can Break Contextual Privacy in Language Models

Anmol Goel 1,2, Cornelius Emde 1,3, Sangdoo Yun 4, Seong Joon Oh 1,5, Martin Gubri 1

0 citations · 45 references · arXiv

α

Published on arXiv

2601.15220

Transfer Learning Attack

OWASP ML Top 10 — ML07

Sensitive Information Disclosure

OWASP LLM Top 10 — LLM06

Key Finding

Benign fine-tuning on five diverse real-world datasets causes severe contextual privacy violations across six frontier LLMs while leaving standard safety and utility benchmarks unaffected.

Privacy Collapse

Novel technique introduced


We identify a novel phenomenon in language models: benign fine-tuning of frontier models can lead to privacy collapse. We find that diverse, subtle patterns in training data can degrade contextual privacy, including optimisation for helpfulness, exposure to user information, emotional and subjective dialogue, and debugging code printing internal variables, among others. Fine-tuned models lose their ability to reason about contextual privacy norms, share information inappropriately with tools, and violate memory boundaries across contexts. Privacy collapse is a ``silent failure'' because models maintain high performance on standard safety and utility benchmarks whilst exhibiting severe privacy vulnerabilities. Our experiments show evidence of privacy collapse across six models (closed and open weight), five fine-tuning datasets (real-world and controlled data), and two task categories (agentic and memory-based). Our mechanistic analysis reveals that privacy representations are uniquely fragile to fine-tuning, compared to task-relevant features which are preserved. Our results reveal a critical gap in current safety evaluations, in particular for the deployment of specialised agents.


Key Contributions

  • Identifies 'privacy collapse' — a silent failure where benign fine-tuning on diverse real-world data patterns destroys contextual privacy reasoning in frontier LLMs across six models
  • Mechanistic analysis showing privacy representations are uniquely fragile to fine-tuning, unlike task-relevant features which are preserved, and a method for identifying privacy-degrading samples
  • Demonstrates that privacy collapse is undetected by standard safety and utility benchmarks, revealing a critical gap in safety evaluations particularly for agentic and memory-based LLM deployments

🛡️ Threat Analysis

Transfer Learning Attack

The paper specifically studies how the fine-tuning (transfer learning) process degrades privacy representations, showing that privacy is uniquely fragile to fine-tuning compared to task-relevant features — fitting 'attacks exploiting the gap between pre-training and fine-tuning distributions'.


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
training_time
Applications
agentic ai systemsmemory-based ai assistantsllm fine-tuning pipelines