defense 2026

EmbTracker: Traceable Black-box Watermarking for Federated Language Models

Haodong Zhao 1, Jinming Hu 1, Yijie Bai 2, Tian Dong 3, Wei Du 2, Zhuosheng Zhang 1, Yanjiao Chen 4, Haojin Zhu 1, Gongshen Liu 1

0 citations

α

Published on arXiv

2603.12089

Model Theft

OWASP ML Top 10 — ML05

Model Poisoning

OWASP ML Top 10 — ML10

Key Finding

Achieves near-100% verification rates with high resilience against fine-tuning, pruning, and quantization removal attacks, and negligible (<1-2%) impact on primary task performance across language and vision-language models.

EmbTracker

Novel technique introduced


Federated Language Model (FedLM) allows a collaborative learning without sharing raw data, yet it introduces a critical vulnerability, as every untrustworthy client may leak the received functional model instance. Current watermarking schemes for FedLM often require white-box access and client-side cooperation, providing only group-level proof of ownership rather than individual traceability. We propose EmbTracker, a server-side, traceable black-box watermarking framework specifically designed for FedLMs. EmbTracker achieves black-box verifiability by embedding a backdoor-based watermark detectable through simple API queries. Client-level traceability is realized by injecting unique identity-specific watermarks into the model distributed to each client. In this way, a leaked model can be attributed to a specific culprit, ensuring robustness even against non-cooperative participants. Extensive experiments on various language and vision-language models demonstrate that EmbTracker achieves robust traceability with verification rates near 100\%, high resilience against removal attacks (fine-tuning, pruning, quantization), and negligible impact on primary task performance (typically within 1-2\%).


Key Contributions

  • Server-side black-box watermarking framework for federated language models requiring no client cooperation, enabling ownership verification via API queries only
  • Per-client identity-specific watermark injection enabling individual client traceability — leaked models can be attributed to a specific culprit
  • Demonstrated resilience against watermark removal attacks (fine-tuning, pruning, quantization) with near-100% verification rates and <2% performance impact

🛡️ Threat Analysis

Model Theft

EmbTracker watermarks are embedded IN THE MODEL WEIGHTS distributed to each federated client, enabling proof of ownership and traceability when a client leaks the model — this is a model IP protection/theft defense, not content provenance. The watermark verifies which client's copy was stolen via black-box API queries.

Model Poisoning

EmbTracker's watermarking mechanism is explicitly backdoor-based — it injects a backdoor trigger pattern per client to serve as the watermark. The paper discusses resilience against backdoor removal attacks (fine-tuning, pruning, quantization), directly engaging ML10 threat model. The backdoor here is the technical mechanism for ownership verification in a federated setting.


Details

Domains
nlpfederated-learningmultimodal
Model Types
llmvlmfederatedtransformer
Threat Tags
black_boxtraining_timewhite_box
Applications
federated language model ip protectionmodel leakage attributionfederated learning