defense 2026

Safe-FedLLM: Delving into the Safety of Federated Large Language Models

Mingxiang Tao 1, Yu Tian 2, Wenxuan Tu 1, Yue Yang 1, Xue Yang 3, Xiangyan Tang 1

0 citations · 38 references · arXiv

α

Published on arXiv

2601.07177

Model Poisoning

OWASP ML Top 10 — ML10

Data Poisoning Attack

OWASP ML Top 10 — ML02

Training Data Poisoning

OWASP LLM Top 10 — LLM03

Key Finding

Safe-FedLLM effectively suppresses backdoor and poisoning attacks from malicious federated clients by probing LoRA weight distributions, maintaining benign task performance even with a majority of malicious clients

Safe-FedLLM

Novel technique introduced


Federated learning (FL) addresses data privacy and silo issues in large language models (LLMs). Most prior work focuses on improving the training efficiency of federated LLMs. However, security in open environments is overlooked, particularly defenses against malicious clients. To investigate the safety of LLMs during FL, we conduct preliminary experiments to analyze potential attack surfaces and defensible characteristics from the perspective of Low-Rank Adaptation (LoRA) weights. We find two key properties of FL: 1) LLMs are vulnerable to attacks from malicious clients in FL, and 2) LoRA weights exhibit distinct behavioral patterns that can be filtered through simple classifiers. Based on these properties, we propose Safe-FedLLM, a probe-based defense framework for federated LLMs, constructing defenses across three dimensions: Step-Level, Client-Level, and Shadow-Level. The core concept of Safe-FedLLM is to perform probe-based discrimination on the LoRA weights locally trained by each client during FL, treating them as high-dimensional behavioral features and using lightweight classification models to determine whether they possess malicious attributes. Extensive experiments demonstrate that Safe-FedLLM effectively enhances the defense capability of federated LLMs without compromising performance on benign data. Notably, our method effectively suppresses malicious data impact without significant impact on training speed, and remains effective even with many malicious clients. Our code is available at: https://github.com/dmqx/Safe-FedLLM.


Key Contributions

  • Empirically demonstrates that federated LLMs are highly vulnerable to malicious client attacks and that LoRA weights carry distinguishable behavioral signatures enabling detection
  • Proposes Safe-FedLLM, a probe-based defense framework that classifies LoRA weight updates as benign or malicious across three defense dimensions: Step-Level, Client-Level, and Shadow-Level
  • Shows that Safe-FedLLM suppresses malicious data impact without degrading training speed or benign task performance, scaling robustly even under high fractions of malicious clients

🛡️ Threat Analysis

Data Poisoning Attack

Defends against model poisoning attacks in federated learning where malicious clients send corrupted updates to degrade the global model; the Byzantine-fault-tolerant aggregation angle is central to the Safe-FedLLM framework.

Model Poisoning

Defends federated LLM training against backdoor injection from malicious clients by analyzing LoRA weight behavioral patterns with lightweight classifiers — a primary attack surface explicitly addressed.


Details

Domains
federated-learningnlp
Model Types
llmfederated
Threat Tags
training_time
Datasets
BeaverTailsLMSYS-ChatWildChat
Applications
federated llm fine-tuningfederated instruction tuning