attack 2026

Revisiting Backdoor Threat in Federated Instruction Tuning from a Signal Aggregation Perspective

Haodong Zhao 1, Jinming Hu 1, Gongshen Liu 1,2

0 citations · 35 references

α

Published on arXiv

2602.15671

Model Poisoning

OWASP ML Top 10 — ML10

Data Poisoning Attack

OWASP ML Top 10 — ML02

Training Data Poisoning

OWASP LLM Top 10 — LLM03

Key Finding

With less than 10% of training data poisoned and distributed across benign clients, attack success rate exceeds 85% while all state-of-the-art federated backdoor defenses — designed assuming malicious clients — fail completely.

BSNR (Backdoor Signal-to-Noise Ratio)

Novel technique introduced


Federated learning security research has predominantly focused on backdoor threats from a minority of malicious clients that intentionally corrupt model updates. This paper challenges this paradigm by investigating a more pervasive and insidious threat: \textit{backdoor vulnerabilities from low-concentration poisoned data distributed across the datasets of benign clients.} This scenario is increasingly common in federated instruction tuning for language models, which often rely on unverified third-party and crowd-sourced data. We analyze two forms of backdoor data through real cases: 1) \textit{natural trigger (inherent features as implicit triggers)}; 2) \textit{adversary-injected trigger}. To analyze this threat, we model the backdoor implantation process from signal aggregation, proposing the Backdoor Signal-to-Noise Ratio to quantify the dynamics of the distributed backdoor signal. Extensive experiments reveal the severity of this threat: With just less than 10\% of training data poisoned and distributed across clients, the attack success rate exceeds 85\%, while the primary task performance remains largely intact. Critically, we demonstrate that state-of-the-art backdoor defenses, designed for attacks from malicious clients, are fundamentally ineffective against this threat. Our findings highlight an urgent need for new defense mechanisms tailored to the realities of modern, decentralized data ecosystems.


Key Contributions

  • Introduces a new threat paradigm: distributed backdoors from low-concentration poisoned data across benign federated clients, bypassing the 'malicious client' assumption of prior defenses.
  • Proposes the Backdoor Signal-to-Noise Ratio (BSNR) metric, grounded in signal processing, to quantify and model aggregated backdoor signal strength during federated aggregation.
  • Empirically demonstrates >85% attack success rate with <10% poisoned data distributed across benign clients, and shows state-of-the-art defenses are fundamentally ineffective against this threat.

🛡️ Threat Analysis

Data Poisoning Attack

The attack vector is data poisoning distributed across benign clients' datasets — the paper explicitly models how low-concentration poisoned training data aggregates to implant the backdoor, and the guidelines specify co-tagging ML02+ML10 when a paper addresses both data poisoning and backdoor injection.

Model Poisoning

Core contribution is demonstrating backdoor implantation (trigger-based hidden behavior) in federated instruction tuning — natural and injected triggers activate adversary-specified outputs while clean task performance remains intact.


Details

Domains
nlpfederated-learning
Model Types
llmfederated
Threat Tags
training_timetargeteddigital
Applications
federated instruction tuninglarge language model fine-tuning