defense arXiv Mar 12, 2026 · 25d ago
Haodong Zhao, Jinming Hu, Yijie Bai et al. · Shanghai Jiao Tong University · Ant Group +2 more
Embeds per-client backdoor watermarks in federated LMs to trace model leaks to individual culprits via black-box queries
Model Theft Model Poisoning nlpfederated-learningmultimodal
Federated Language Model (FedLM) allows a collaborative learning without sharing raw data, yet it introduces a critical vulnerability, as every untrustworthy client may leak the received functional model instance. Current watermarking schemes for FedLM often require white-box access and client-side cooperation, providing only group-level proof of ownership rather than individual traceability. We propose EmbTracker, a server-side, traceable black-box watermarking framework specifically designed for FedLMs. EmbTracker achieves black-box verifiability by embedding a backdoor-based watermark detectable through simple API queries. Client-level traceability is realized by injecting unique identity-specific watermarks into the model distributed to each client. In this way, a leaked model can be attributed to a specific culprit, ensuring robustness even against non-cooperative participants. Extensive experiments on various language and vision-language models demonstrate that EmbTracker achieves robust traceability with verification rates near 100\%, high resilience against removal attacks (fine-tuning, pruning, quantization), and negligible impact on primary task performance (typically within 1-2\%).
llm vlm federated transformer Shanghai Jiao Tong University · Ant Group · The University of Hong Kong +1 more
defense arXiv Mar 30, 2026 · 7d ago
Fei Wu, Guanghao Ding, Zijian Niu et al. · Shanghai Jiao Tong University
Combines lightweight artifact detectors with multimodal LLMs via fuzzy decision trees for generalizable AI-generated image detection
Output Integrity Attack visionmultimodal
The malicious use and widespread dissemination of AI-generated images pose a serious threat to the authenticity of digital content. Existing detection methods exploit low-level artifacts left by common manipulation steps within the generation pipeline, but they often lack generalization due to model-specific overfitting. Recently, researchers have resorted to Multimodal Large Language Models (MLLMs) for AIGC detection, leveraging their high-level semantic reasoning and broad generalization capabilities. While promising, MLLMs lack the fine-grained perceptual sensitivity to subtle generation artifacts, making them inadequate as standalone detectors. To address this issue, we propose a novel AI-generated image detection framework that synergistically integrates lightweight artifact-aware detectors with MLLMs via a fuzzy decision tree. The decision tree treats the outputs of basic detectors as fuzzy membership values, enabling adaptive fusion of complementary cues from semantic and perceptual perspectives. Extensive experiments demonstrate that the proposed method achieves state-of-the-art accuracy and strong generalization across diverse generative models.
multimodal gan diffusion Shanghai Jiao Tong University
defense arXiv Feb 28, 2026 · 5w ago
Haodong Zhao, Jinming Hu, Zhaomin Wu et al. · Shanghai Jiao Tong University · National University of Singapore +1 more
Defends federated LLM instruction tuning against interspersed backdoor poisoning using frequency-domain gradient signals and global clustering
Model Poisoning Data Poisoning Attack nlpfederated-learning
Federated Instruction Tuning (FIT) enables collaborative instruction tuning of large language models across multiple organizations (clients) in a cross-silo setting without requiring the sharing of private instructions. Recent findings on natural backdoors and the existing training data collection method suggest that poisoned samples may be pervasive and inadvertently embedded in real-world datasets, potentially distributed across all clients, even if the clients are benign. This work systematically examine this threat in FIT, demonstrating that existing defenses are ineffective when poisoned data is interspersed among all clients. Addressing this challenge entails two major difficulties: identifying the distinctive characteristics of poisoned samples at each client and enabling collaborative defense when some clients are heavily dominated by poisoned samples. To address these difficulties, we identify gradients in the frequency domain as a robust signal to distinguish poisoned data. We further propose a global secondary clustering mechanism that facilitates collaborative identification of poisoned samples across clients. In summary, this paper introduces ProtegoFed, the first backdoor-free FIT framework that accurately detects, removes, and even purifies interspersed poisoned data across clients during the training. Experimental results on four FL datasets show that ProtegoFed identifies $92.00\% \sim 100.00\%$ of poisoned samples, reduces the attack success rate to almost zero, and maintains utility on the main task. Code is available at https://github.com/dongdongzhaoUP/ProtegoFed.
llm federated Shanghai Jiao Tong University · National University of Singapore · Ant Group