Gap-K%: Measuring Top-1 Prediction Gap for Detecting Pretraining Data
Published on arXiv
2601.19936
Membership Inference Attack
OWASP ML Top 10 — ML04
Key Finding
Gap-K% consistently outperforms prior token-likelihood-based membership inference baselines across various LLM sizes and input lengths on WikiMIA and MIMIR benchmarks.
Gap-K%
Novel technique introduced
The opacity of massive pretraining corpora in Large Language Models (LLMs) raises significant privacy and copyright concerns, making pretraining data detection a critical challenge. Existing state-of-the-art methods typically rely on token likelihoods, yet they often overlook the divergence from the model's top-1 prediction and local correlation between adjacent tokens. In this work, we propose Gap-K%, a novel pretraining data detection method grounded in the optimization dynamics of LLM pretraining. By analyzing the next-token prediction objective, we observe that discrepancies between the model's top-1 prediction and the target token induce strong gradient signals, which are explicitly penalized during training. Motivated by this, Gap-K% leverages the log probability gap between the top-1 predicted token and the target token, incorporating a sliding window strategy to capture local correlations and mitigate token-level fluctuations. Extensive experiments on the WikiMIA and MIMIR benchmarks demonstrate that Gap-K% achieves state-of-the-art performance, consistently outperforming prior baselines across various model sizes and input lengths.
Key Contributions
- Introduces Gap-K%, a pretraining data detection method leveraging the log probability gap between the model's top-1 predicted token and the actual target token, motivated by gradient dynamics of the next-token prediction objective
- Incorporates a sliding window strategy to capture local token correlations and smooth token-level score fluctuations
- Achieves state-of-the-art membership inference performance on WikiMIA and MIMIR benchmarks across multiple model sizes and input lengths
🛡️ Threat Analysis
Gap-K% is a membership inference attack: it answers the binary question 'was this specific text in the LLM's training set?' by exploiting top-1 token prediction gaps and local token correlations. Evaluated on WikiMIA and MIMIR, canonical MIA benchmarks.