defense 2026

ProtegoFed: Backdoor-Free Federated Instruction Tuning with Interspersed Poisoned Data

Haodong Zhao 1, Jinming Hu 1, Zhaomin Wu 1,2, Zongru Wu 1, Wei Du 3, Junyi Hou 2, Caibei Zhao 3, Zhuosheng Zhang 1, Bingsheng He 2, Gongshen Liu 1

0 citations

α

Published on arXiv

2603.00516

Model Poisoning

OWASP ML Top 10 — ML10

Data Poisoning Attack

OWASP ML Top 10 — ML02

Key Finding

ProtegoFed identifies 92%–100% of poisoned samples and reduces the backdoor attack success rate to almost zero while maintaining main task utility across four FL datasets

ProtegoFed

Novel technique introduced


Federated Instruction Tuning (FIT) enables collaborative instruction tuning of large language models across multiple organizations (clients) in a cross-silo setting without requiring the sharing of private instructions. Recent findings on natural backdoors and the existing training data collection method suggest that poisoned samples may be pervasive and inadvertently embedded in real-world datasets, potentially distributed across all clients, even if the clients are benign. This work systematically examine this threat in FIT, demonstrating that existing defenses are ineffective when poisoned data is interspersed among all clients. Addressing this challenge entails two major difficulties: identifying the distinctive characteristics of poisoned samples at each client and enabling collaborative defense when some clients are heavily dominated by poisoned samples. To address these difficulties, we identify gradients in the frequency domain as a robust signal to distinguish poisoned data. We further propose a global secondary clustering mechanism that facilitates collaborative identification of poisoned samples across clients. In summary, this paper introduces ProtegoFed, the first backdoor-free FIT framework that accurately detects, removes, and even purifies interspersed poisoned data across clients during the training. Experimental results on four FL datasets show that ProtegoFed identifies $92.00\% \sim 100.00\%$ of poisoned samples, reduces the attack success rate to almost zero, and maintains utility on the main task. Code is available at https://github.com/dongdongzhaoUP/ProtegoFed.


Key Contributions

  • Identifies frequency-domain gradients as a robust signal to distinguish poisoned from clean samples in federated instruction tuning
  • Proposes a global secondary clustering mechanism for collaborative identification of poisoned samples across clients, even when some clients are heavily dominated by poisoned data
  • Introduces ProtegoFed, the first backdoor-free federated instruction tuning framework that detects, removes, and purifies interspersed poisoned data, reducing attack success rate to near zero

🛡️ Threat Analysis

Data Poisoning Attack

The attack vector is poisoned training data interspersed across all federated clients (data poisoning), and the defense includes detecting and sanitizing these poisoned samples from training data.

Model Poisoning

Primary contribution is detecting and removing backdoor-poisoned samples during federated instruction tuning of LLMs — directly addresses the backdoor/trojan threat with trigger-based targeted behavior.


Details

Domains
nlpfederated-learning
Model Types
llmfederated
Threat Tags
training_timetargeted
Datasets
four FL datasets (unspecified in abstract/body excerpt)
Applications
federated instruction tuninglarge language model fine-tuning