defense 2025

Provable Watermarking for Data Poisoning Attacks

Yifan Zhu 1,2, Lijia Yu 3, Xiao-Shan Gao 1,2

1 citations · 99 references · arXiv

α

Published on arXiv

2510.09210

Output Integrity Attack

OWASP ML Top 10 — ML09

Data Poisoning Attack

OWASP ML Top 10 — ML02

Key Finding

Proves that watermarked poisoned datasets with length Θ(√d/εw) (post-poisoning) or in range Θ(1/εw²) to O(√d/εp) (concurrent) provably guarantee both watermark detectability and preservation of poisoning utility

Poisoning-Concurrent Watermarking

Novel technique introduced


In recent years, data poisoning attacks have been increasingly designed to appear harmless and even beneficial, often with the intention of verifying dataset ownership or safeguarding private data from unauthorized use. However, these developments have the potential to cause misunderstandings and conflicts, as data poisoning has traditionally been regarded as a security threat to machine learning systems. To address this issue, it is imperative for harmless poisoning generators to claim ownership of their generated datasets, enabling users to identify potential poisoning to prevent misuse. In this paper, we propose the deployment of watermarking schemes as a solution to this challenge. We introduce two provable and practical watermarking approaches for data poisoning: {\em post-poisoning watermarking} and {\em poisoning-concurrent watermarking}. Our analyses demonstrate that when the watermarking length is $Θ(\sqrt{d}/ε_w)$ for post-poisoning watermarking, and falls within the range of $Θ(1/ε_w^2)$ to $O(\sqrt{d}/ε_p)$ for poisoning-concurrent watermarking, the watermarked poisoning dataset provably ensures both watermarking detectability and poisoning utility, certifying the practicality of watermarking under data poisoning attacks. We validate our theoretical findings through experiments on several attacks, models, and datasets.


Key Contributions

  • Two provable watermarking schemes for poisoned datasets: post-poisoning watermarking (third-party embeds watermarks) and poisoning-concurrent watermarking (poisoner embeds watermarks during poisoning generation)
  • Theoretical watermarking length bounds guaranteeing both detectability and poisoning utility: Θ(√d/εw) for post-poisoning and Θ(1/εw²) to O(√d/εp) for poisoning-concurrent
  • Extension to universal (single key for entire dataset) watermarking with adjusted length bounds and empirical validation across multiple attacks, models, and datasets

🛡️ Threat Analysis

Data Poisoning Attack

The entire system is designed around data poisoning attacks (specifically backdoor and availability attacks), and the watermarking enables users to detect the presence of poisoning before training on affected data — a functional poison-detection mechanism that mitigates the downstream security threat of ML02.

Output Integrity Attack

The paper's primary technical contribution is training data watermarking for provenance and ownership verification — two schemes (post-poisoning and poisoning-concurrent) with provable detectability bounds that let dataset creators prove they produced and distributed a poisoned dataset. Per the spec, watermarking training data to detect misappropriation or assert ownership of data belongs to ML09.


Details

Domains
vision
Model Types
cnntransformer
Threat Tags
training_time
Applications
dataset ownership verificationpoisoned dataset detectioncopyright protection for datasets