defense 2025

A Statistical Hypothesis Testing Framework for Data Misappropriation Detection in Large Language Models

Yinpeng Cai 1, Lexin Li 2, Linjun Zhang 3

3 citations · 27 references · arXiv

α

Published on arXiv

2501.02441

Output Integrity Attack

OWASP ML Top 10 — ML09

Key Finding

Proposed tests achieve asymptotic optimality with explicit control over type I and type II errors for detecting LLM training data misappropriation via watermarks.


Large Language Models (LLMs) are rapidly gaining enormous popularity in recent years. However, the training of LLMs has raised significant privacy and legal concerns, particularly regarding the distillation and inclusion of copyrighted materials in their training data without proper attribution or licensing, an issue that falls under the broader concern of data misappropriation. In this article, we focus on a specific problem of data misappropriation detection, namely, to determine whether a given LLM has incorporated the data generated by another LLM. We propose embedding watermarks into the copyrighted training data and formulating the detection of data misappropriation as a hypothesis testing problem. We develop a general statistical testing framework, construct test statistics, determine optimal rejection thresholds, and explicitly control type I and type II errors. Furthermore, we establish the asymptotic optimality properties of the proposed tests, and demonstrate the empirical effectiveness through intensive numerical experiments.


Key Contributions

  • Statistical hypothesis testing framework that formalizes data misappropriation detection as a test of token-key dependency, with explicit type I and type II error control
  • Optimal rejection threshold derivation via large deviation theory and minimax optimization, with asymptotic optimality guarantees
  • Empirical validation showing the framework can detect whether an LLM has been trained on watermarked outputs from another LLM

🛡️ Threat Analysis

Output Integrity Attack

The paper watermarks LLM text OUTPUTS (training data content), then detects whether another LLM has incorporated that watermarked content into its training corpus. Per the taxonomy: watermarking training data to detect misappropriation ('did someone train on my data?') maps to ML09 — output integrity and content provenance. The watermark is in the generated content/data, not in model weights, so this is not ML05.


Details

Domains
nlp
Model Types
llm
Threat Tags
training_timeblack_box
Applications
llm copyright protectiontraining data misappropriation detectionknowledge distillation detection