defense 2025

Leave No TRACE: Black-box Detection of Copyrighted Dataset Usage in Large Language Models via Watermarking

Jingqi Zhang 1, Ruibo Chen 2, Erhan Xu 3,4, Peihua Mai 1, Heng Huang 2, Yan Pang 1

5 citations · 41 references · arXiv

α

Published on arXiv

2510.02962

Output Integrity Attack

OWASP ML Top 10 — ML09

Key Finding

TRACE achieves statistically significant dataset-usage detections (p<0.05) in fully black-box settings across diverse LLM families, remains robust after continued pretraining on non-watermarked corpora, and supports attribution of multiple protected datasets.

TRACE

Novel technique introduced


Large Language Models (LLMs) are increasingly fine-tuned on smaller, domain-specific datasets to improve downstream performance. These datasets often contain proprietary or copyrighted material, raising the need for reliable safeguards against unauthorized use. Existing membership inference attacks (MIAs) and dataset-inference methods typically require access to internal signals such as logits, while current black-box approaches often rely on handcrafted prompts or a clean reference dataset for calibration, both of which limit practical applicability. Watermarking is a promising alternative, but prior techniques can degrade text quality or reduce task performance. We propose TRACE, a practical framework for fully black-box detection of copyrighted dataset usage in LLM fine-tuning. \texttt{TRACE} rewrites datasets with distortion-free watermarks guided by a private key, ensuring both text quality and downstream utility. At detection time, we exploit the radioactivity effect of fine-tuning on watermarked data and introduce an entropy-gated procedure that selectively scores high-uncertainty tokens, substantially amplifying detection power. Across diverse datasets and model families, TRACE consistently achieves significant detections (p<0.05), often with extremely strong statistical evidence. Furthermore, it supports multi-dataset attribution and remains robust even after continued pretraining on large non-watermarked corpora. These results establish TRACE as a practical route to reliable black-box verification of copyrighted dataset usage. We will make our code available at: https://github.com/NusIoraPrivacy/TRACE.


Key Contributions

  • Distortion-free dataset watermarking scheme guided by a private key that preserves text quality and downstream task utility when used as LLM fine-tuning data
  • Entropy-gated detection procedure that selectively scores high-uncertainty tokens to amplify the radioactivity signal of watermarked fine-tuning data, enabling fully black-box detection
  • Support for multi-dataset attribution and robustness to continued pretraining on large non-watermarked corpora, with statistically significant detections (p<0.05) across diverse datasets and model families

🛡️ Threat Analysis

Output Integrity Attack

TRACE watermarks TRAINING DATA (not model weights) with a private-key-guided distortion-free scheme to detect misappropriation — exactly the 'watermarking training data to detect if someone trained on my data' case explicitly listed under ML09. Detection exploits the radioactivity effect to observe watermark signals in model text outputs, making this a content provenance and output integrity problem. Per the classification guide, data watermarking for misappropriation detection is ML09, not ML05.


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
black_boxtraining_time
Applications
llm fine-tuningcopyright protectiondataset misappropriation detection