attack 2025

Can Federated Learning Safeguard Private Data in LLM Training? Vulnerabilities, Attacks, and Defense Evaluation

0 citations · 40 references · EMNLP

Published on arXiv

2509.20680

Model Inversion Attack

OWASP ML Top 10 — ML03

Sensitive Information Disclosure

OWASP LLM Top 10 — LLM06

Key Finding

FL does not adequately protect client training data in LLM fine-tuning; an enhanced attack exploiting iterative model update tracking intensifies leakage beyond simple generation baselines, and existing privacy defenses reduce leakage only at the cost of significant model utility.

Fine-tuning large language models (LLMs) with local data is a widely adopted approach for organizations seeking to adapt LLMs to their specific domains. Given the shared characteristics in data across different organizations, the idea of collaboratively fine-tuning an LLM using data from multiple sources presents an appealing opportunity. However, organizations are often reluctant to share local data, making centralized fine-tuning impractical. Federated learning (FL), a privacy-preserving framework, enables clients to retain local data while sharing only model parameters for collaborative training, offering a potential solution. While fine-tuning LLMs on centralized datasets risks data leakage through next-token prediction, the iterative aggregation process in FL results in a global model that encapsulates generalized knowledge, which some believe protects client privacy. In this paper, however, we present contradictory findings through extensive experiments. We show that attackers can still extract training data from the global model, even using straightforward generation methods, with leakage increasing as the model size grows. Moreover, we introduce an enhanced attack strategy tailored to FL, which tracks global model updates during training to intensify privacy leakage. To mitigate these risks, we evaluate privacy-preserving techniques in FL, including differential privacy, regularization-constrained updates and adopting LLMs with safety alignment. Our results provide valuable insights and practical guidelines for reducing privacy risks when training LLMs with FL.

Key Contributions

Demonstrates that FL global models trained on LLMs still leak private client training data through text generation, with leakage severity increasing with model size
Proposes an enhanced FL-specific attack that monitors global model parameter updates across aggregation rounds to amplify training data extraction
Evaluates differential privacy, regularization-constrained updates, and safety-aligned LLMs as mitigations, showing utility-privacy trade-offs and providing practical guidelines

🛡️ Threat Analysis

Model Inversion Attack

The paper's core contribution is demonstrating that an adversary can reconstruct/extract private training data from FL-trained LLM global models, including a novel enhanced attack that tracks iterative global model parameter updates to intensify data leakage — a direct model inversion / training data reconstruction threat in the FL setting.

Details

Domains

nlpfederated-learning

Model Types

llmfederatedtransformer

Threat Tags

white_boxtraining_timeinference_time

Applications

federated llm fine-tuningcollaborative language model training

Read PDF arXiv DOI Code

Can Federated Learning Safeguard Private Data in LLM Training? Vulnerabilities, Attacks, and Defense Evaluation

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

SOMP: Scalable Gradient Inversion for Large Language Models via Subspace-Guided Orthogonal Matching Pursuit

Reconstructing Training Data from Adapter-based Federated Large Language Models

SecureGate: Learning When to Reveal PII Safely via Token-Gated Dual-Adapters for Federated LLMs

Retracing the Past: LLMs Emit Training Data When They Get Lost

Model Inversion in Split Learning for Personalized LLMs: New Insights from Information Bottleneck Theory

Mitigating Gradient Inversion Risks in Language Models via Token Obfuscation

Language Models are Injective and Hence Invertible

Reverse-Engineering Model Editing on Language Models