Provable Training Data Identification for Large Language Models

Identifying training data of large-scale models is critical for copyright litigation, privacy auditing, and ensuring fair evaluation. However, existing works typically treat this task as an instance-wise identification without controlling the error rate of the identified set, which cannot provide statistically reliable evidence. In this work, we formalize training data identification as a set-level inference problem and propose Provable Training Data Identification (PTDI), a distribution-free approach that enables provable and strict false identification rate control. Specifically, our method computes conformal p-values for each data point using a set of known unseen data and then develops a novel Jackknife-corrected Beta boundary (JKBB) estimator to estimate the training-data proportion of the test set, which allows us to scale these p-values. By applying the Benjamini-Hochberg (BH) procedure to the scaled p-values, we select a subset of data points with provable and strict false identification control. Extensive experiments across various models and datasets demonstrate that PTDI achieves higher power than prior methods while strictly controlling the FIR.

Key Contributions

Formalizes training data identification as a set-level inference problem with provable false identification rate (FIR) control, unlike prior instance-wise approaches
Proposes the Jackknife-corrected Beta boundary (JKBB) estimator to estimate training-data proportion in the test set, enabling data-dependent p-value scaling
Applies the Benjamini-Hochberg procedure to conformal p-values for statistically rigorous, high-power training data identification compatible with both black-box and white-box detection scores

🛡️ Threat Analysis

Membership Inference Attack

The paper's core contribution is determining whether specific data points were in an LLM's training set — the definition of membership inference. It formalizes and improves MIA methodology by shifting from instance-wise binary classification to set-level inference with statistical guarantees (FIR control), directly applicable to privacy auditing and copyright litigation.