benchmark 2025

Watermark Robustness and Radioactivity May Be at Odds in Federated Learning

Leixu Huang , Zedian Shao , Teodora Baluta

0 citations · 66 references · arXiv

α

Published on arXiv

2510.17033

Output Integrity Attack

OWASP ML Top 10 — ML09

Key Finding

All evaluated radioactive watermarking schemes are fully defeated by a server adversary applying robust aggregation to filter watermark-induced gradient outliers, while preserving model utility.


Federated learning (FL) enables fine-tuning large language models (LLMs) across distributed data sources. As these sources increasingly include LLM-generated text, provenance tracking becomes essential for accountability and transparency. We adapt LLM watermarking for data provenance in FL where a subset of clients compute local updates on watermarked data, and the server averages all updates into the global LLM. In this setup, watermarks are radioactive: the watermark signal remains detectable after fine-tuning with high confidence. The $p$-value can reach $10^{-24}$ even when as little as $6.6\%$ of data is watermarked. However, the server can act as an active adversary that wants to preserve model utility while evading provenance tracking. Our observation is that updates induced by watermarked synthetic data appear as outliers relative to non-watermark updates. Our adversary thus applies strong robust aggregation that can filter these outliers, together with the watermark signal. All evaluated radioactive watermarks are not robust against such an active filtering server. Our work suggests fundamental trade-offs between radioactivity, robustness, and utility.


Key Contributions

  • Demonstrates that LLM text watermarks are radioactive in federated learning — surviving fine-tuning with p-values as low as 10^-24 even when only 6.6% of training data is watermarked
  • Shows that an adversarial server can exploit the outlier structure of watermark-induced gradient updates to defeat all evaluated radioactive watermarks via robust aggregation
  • Identifies a fundamental three-way trade-off between radioactivity, robustness against adversarial servers, and model utility in FL watermarking

🛡️ Threat Analysis

Output Integrity Attack

The paper is centrally about watermarking LLM-generated text outputs for data provenance tracking (content watermarks, not model-weight watermarks), and the server adversary's robust aggregation strategy removes/defeats these content provenance watermarks — both the watermarking scheme and the attack against it are ML09. Per the guidelines, watermarking training data to detect misappropriation maps to ML09.


Details

Domains
nlpfederated-learning
Model Types
llmfederated
Threat Tags
grey_boxtraining_time
Applications
federated llm fine-tuningdata provenance trackingllm-generated content attribution