defense 2025

Leave No TRACE: Black-box Detection of Copyrighted Dataset Usage in Large Language Models via Watermarking

5 citations · 41 references · arXiv

Published on arXiv

2510.02962

Output Integrity Attack

OWASP ML Top 10 — ML09

Key Finding

TRACE achieves statistically significant dataset-usage detections (p<0.05) in fully black-box settings across diverse LLM families, remains robust after continued pretraining on non-watermarked corpora, and supports attribution of multiple protected datasets.

TRACE

Novel technique introduced

Large Language Models (LLMs) are increasingly fine-tuned on smaller, domain-specific datasets to improve downstream performance. These datasets often contain proprietary or copyrighted material, raising the need for reliable safeguards against unauthorized use. Existing membership inference attacks (MIAs) and dataset-inference methods typically require access to internal signals such as logits, while current black-box approaches often rely on handcrafted prompts or a clean reference dataset for calibration, both of which limit practical applicability. Watermarking is a promising alternative, but prior techniques can degrade text quality or reduce task performance. We propose TRACE, a practical framework for fully black-box detection of copyrighted dataset usage in LLM fine-tuning. \texttt{TRACE} rewrites datasets with distortion-free watermarks guided by a private key, ensuring both text quality and downstream utility. At detection time, we exploit the radioactivity effect of fine-tuning on watermarked data and introduce an entropy-gated procedure that selectively scores high-uncertainty tokens, substantially amplifying detection power. Across diverse datasets and model families, TRACE consistently achieves significant detections (p<0.05), often with extremely strong statistical evidence. Furthermore, it supports multi-dataset attribution and remains robust even after continued pretraining on large non-watermarked corpora. These results establish TRACE as a practical route to reliable black-box verification of copyrighted dataset usage. We will make our code available at: https://github.com/NusIoraPrivacy/TRACE.

Key Contributions

Distortion-free dataset watermarking scheme guided by a private key that preserves text quality and downstream task utility when used as LLM fine-tuning data
Entropy-gated detection procedure that selectively scores high-uncertainty tokens to amplify the radioactivity signal of watermarked fine-tuning data, enabling fully black-box detection
Support for multi-dataset attribution and robustness to continued pretraining on large non-watermarked corpora, with statistically significant detections (p<0.05) across diverse datasets and model families

🛡️ Threat Analysis

Output Integrity Attack

TRACE watermarks TRAINING DATA (not model weights) with a private-key-guided distortion-free scheme to detect misappropriation — exactly the 'watermarking training data to detect if someone trained on my data' case explicitly listed under ML09. Detection exploits the radioactivity effect to observe watermark signals in model text outputs, making this a content provenance and output integrity problem. Per the classification guide, data watermarking for misappropriation detection is ML09, not ML05.

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

black_boxtraining_time

Applications

llm fine-tuningcopyright protectiondataset misappropriation detection

Read PDF arXiv DOI Code

Leave No TRACE: Black-box Detection of Copyrighted Dataset Usage in Large Language Models via Watermarking

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

SLIM: Stealthy Low-Coverage Black-Box Watermarking via Latent-Space Confusion Zones

Perturb Your Data: Paraphrase-Guided Training Data Watermarking

Data Provenance Auditing of Fine-Tuned Large Language Models with a Text-Preserving Technique

A Statistical Hypothesis Testing Framework for Data Misappropriation Detection in Large Language Models

Watermarking LLM Agent Trajectories

Black-box Detection of LLM-generated Text Using Generalized Jensen-Shannon Divergence

SENTRA: Selected-Next-Token Transformer for LLM Text Detection

RADAR: Retrieval-Augmented Detector with Adversarial Refinement for Robust Fake News Detection