EmbTracker: Traceable Black-box Watermarking for Federated Language Models
Haodong Zhao 1, Jinming Hu 1, Yijie Bai 2, Tian Dong 3, Wei Du 2, Zhuosheng Zhang 1, Yanjiao Chen 4, Haojin Zhu 1, Gongshen Liu 1
Published on arXiv
2603.12089
Model Theft
OWASP ML Top 10 — ML05
Model Poisoning
OWASP ML Top 10 — ML10
Key Finding
Achieves near-100% verification rates with high resilience against fine-tuning, pruning, and quantization removal attacks, and negligible (<1-2%) impact on primary task performance across language and vision-language models.
EmbTracker
Novel technique introduced
Federated Language Model (FedLM) allows a collaborative learning without sharing raw data, yet it introduces a critical vulnerability, as every untrustworthy client may leak the received functional model instance. Current watermarking schemes for FedLM often require white-box access and client-side cooperation, providing only group-level proof of ownership rather than individual traceability. We propose EmbTracker, a server-side, traceable black-box watermarking framework specifically designed for FedLMs. EmbTracker achieves black-box verifiability by embedding a backdoor-based watermark detectable through simple API queries. Client-level traceability is realized by injecting unique identity-specific watermarks into the model distributed to each client. In this way, a leaked model can be attributed to a specific culprit, ensuring robustness even against non-cooperative participants. Extensive experiments on various language and vision-language models demonstrate that EmbTracker achieves robust traceability with verification rates near 100\%, high resilience against removal attacks (fine-tuning, pruning, quantization), and negligible impact on primary task performance (typically within 1-2\%).
Key Contributions
- Server-side black-box watermarking framework for federated language models requiring no client cooperation, enabling ownership verification via API queries only
- Per-client identity-specific watermark injection enabling individual client traceability — leaked models can be attributed to a specific culprit
- Demonstrated resilience against watermark removal attacks (fine-tuning, pruning, quantization) with near-100% verification rates and <2% performance impact
🛡️ Threat Analysis
EmbTracker watermarks are embedded IN THE MODEL WEIGHTS distributed to each federated client, enabling proof of ownership and traceability when a client leaks the model — this is a model IP protection/theft defense, not content provenance. The watermark verifies which client's copy was stolen via black-box API queries.
EmbTracker's watermarking mechanism is explicitly backdoor-based — it injects a backdoor trigger pattern per client to serve as the watermark. The paper discusses resilience against backdoor removal attacks (fine-tuning, pruning, quantization), directly engaging ML10 threat model. The backdoor here is the technical mechanism for ownership verification in a federated setting.