Watermark Robustness and Radioactivity May Be at Odds in Federated Learning
Leixu Huang , Zedian Shao , Teodora Baluta
Published on arXiv
2510.17033
Output Integrity Attack
OWASP ML Top 10 — ML09
Key Finding
All evaluated radioactive watermarking schemes are fully defeated by a server adversary applying robust aggregation to filter watermark-induced gradient outliers, while preserving model utility.
Federated learning (FL) enables fine-tuning large language models (LLMs) across distributed data sources. As these sources increasingly include LLM-generated text, provenance tracking becomes essential for accountability and transparency. We adapt LLM watermarking for data provenance in FL where a subset of clients compute local updates on watermarked data, and the server averages all updates into the global LLM. In this setup, watermarks are radioactive: the watermark signal remains detectable after fine-tuning with high confidence. The $p$-value can reach $10^{-24}$ even when as little as $6.6\%$ of data is watermarked. However, the server can act as an active adversary that wants to preserve model utility while evading provenance tracking. Our observation is that updates induced by watermarked synthetic data appear as outliers relative to non-watermark updates. Our adversary thus applies strong robust aggregation that can filter these outliers, together with the watermark signal. All evaluated radioactive watermarks are not robust against such an active filtering server. Our work suggests fundamental trade-offs between radioactivity, robustness, and utility.
Key Contributions
- Demonstrates that LLM text watermarks are radioactive in federated learning — surviving fine-tuning with p-values as low as 10^-24 even when only 6.6% of training data is watermarked
- Shows that an adversarial server can exploit the outlier structure of watermark-induced gradient updates to defeat all evaluated radioactive watermarks via robust aggregation
- Identifies a fundamental three-way trade-off between radioactivity, robustness against adversarial servers, and model utility in FL watermarking
🛡️ Threat Analysis
The paper is centrally about watermarking LLM-generated text outputs for data provenance tracking (content watermarks, not model-weight watermarks), and the server adversary's robust aggregation strategy removes/defeats these content provenance watermarks — both the watermarking scheme and the attack against it are ML09. Per the guidelines, watermarking training data to detect misappropriation maps to ML09.