A Statistical Hypothesis Testing Framework for Data Misappropriation Detection in Large Language Models

Large Language Models (LLMs) are rapidly gaining enormous popularity in recent years. However, the training of LLMs has raised significant privacy and legal concerns, particularly regarding the distillation and inclusion of copyrighted materials in their training data without proper attribution or licensing, an issue that falls under the broader concern of data misappropriation. In this article, we focus on a specific problem of data misappropriation detection, namely, to determine whether a given LLM has incorporated the data generated by another LLM. We propose embedding watermarks into the copyrighted training data and formulating the detection of data misappropriation as a hypothesis testing problem. We develop a general statistical testing framework, construct test statistics, determine optimal rejection thresholds, and explicitly control type I and type II errors. Furthermore, we establish the asymptotic optimality properties of the proposed tests, and demonstrate the empirical effectiveness through intensive numerical experiments.

Key Contributions

Statistical hypothesis testing framework that formalizes data misappropriation detection as a test of token-key dependency, with explicit type I and type II error control
Optimal rejection threshold derivation via large deviation theory and minimax optimization, with asymptotic optimality guarantees
Empirical validation showing the framework can detect whether an LLM has been trained on watermarked outputs from another LLM

🛡️ Threat Analysis

Output Integrity Attack

The paper watermarks LLM text OUTPUTS (training data content), then detects whether another LLM has incorporated that watermarked content into its training corpus. Per the taxonomy: watermarking training data to detect misappropriation ('did someone train on my data?') maps to ML09 — output integrity and content provenance. The watermark is in the generated content/data, not in model weights, so this is not ML05.

Details

Domains

nlp

Model Types

llm

Threat Tags

training_timeblack_box

Applications

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Data Provenance Auditing of Fine-Tuned Large Language Models with a Text-Preserving Technique

SLIM: Stealthy Low-Coverage Black-Box Watermarking via Latent-Space Confusion Zones

Perturb Your Data: Paraphrase-Guided Training Data Watermarking

Leave No TRACE: Black-box Detection of Copyrighted Dataset Usage in Large Language Models via Watermarking

CAMF: Collaborative Adversarial Multi-agent Framework for Machine Generated Text Detection

Unforgeable Watermarks for Language Models via Robust Signatures

CODE ACROSTIC: Robust Watermarking for Code Generation

Privacy-Preserving Mechanisms Enable Cheap Verifiable Inference of LLMs