Leave No TRACE: Black-box Detection of Copyrighted Dataset Usage in Large Language Models via Watermarking
Jingqi Zhang 1, Ruibo Chen 2, Erhan Xu 3,4, Peihua Mai 1, Heng Huang 2, Yan Pang 1
1 National University of Singapore
3 National Key Laboratory of Intelligent Automotive Safety Technology
Published on arXiv
2510.02962
Output Integrity Attack
OWASP ML Top 10 — ML09
Key Finding
TRACE achieves statistically significant dataset-usage detections (p<0.05) in fully black-box settings across diverse LLM families, remains robust after continued pretraining on non-watermarked corpora, and supports attribution of multiple protected datasets.
TRACE
Novel technique introduced
Large Language Models (LLMs) are increasingly fine-tuned on smaller, domain-specific datasets to improve downstream performance. These datasets often contain proprietary or copyrighted material, raising the need for reliable safeguards against unauthorized use. Existing membership inference attacks (MIAs) and dataset-inference methods typically require access to internal signals such as logits, while current black-box approaches often rely on handcrafted prompts or a clean reference dataset for calibration, both of which limit practical applicability. Watermarking is a promising alternative, but prior techniques can degrade text quality or reduce task performance. We propose TRACE, a practical framework for fully black-box detection of copyrighted dataset usage in LLM fine-tuning. \texttt{TRACE} rewrites datasets with distortion-free watermarks guided by a private key, ensuring both text quality and downstream utility. At detection time, we exploit the radioactivity effect of fine-tuning on watermarked data and introduce an entropy-gated procedure that selectively scores high-uncertainty tokens, substantially amplifying detection power. Across diverse datasets and model families, TRACE consistently achieves significant detections (p<0.05), often with extremely strong statistical evidence. Furthermore, it supports multi-dataset attribution and remains robust even after continued pretraining on large non-watermarked corpora. These results establish TRACE as a practical route to reliable black-box verification of copyrighted dataset usage. We will make our code available at: https://github.com/NusIoraPrivacy/TRACE.
Key Contributions
- Distortion-free dataset watermarking scheme guided by a private key that preserves text quality and downstream task utility when used as LLM fine-tuning data
- Entropy-gated detection procedure that selectively scores high-uncertainty tokens to amplify the radioactivity signal of watermarked fine-tuning data, enabling fully black-box detection
- Support for multi-dataset attribution and robustness to continued pretraining on large non-watermarked corpora, with statistically significant detections (p<0.05) across diverse datasets and model families
🛡️ Threat Analysis
TRACE watermarks TRAINING DATA (not model weights) with a private-key-guided distortion-free scheme to detect misappropriation — exactly the 'watermarking training data to detect if someone trained on my data' case explicitly listed under ML09. Detection exploits the radioactivity effect to observe watermark signals in model text outputs, making this a content provenance and output integrity problem. Per the classification guide, data watermarking for misappropriation detection is ML09, not ML05.