Learning to Detect Language Model Training Data via Active Reconstruction
Junjie Oscar Yin 1, John X. Morris 2, Vitaly Shmatikov 2, Sewon Min 3,4, Hannaneh Hajishirzi 1,4
Published on arXiv
2602.19020
Membership Inference Attack
OWASP ML Top 10 — ML04
Sensitive Information Disclosure
OWASP LLM Top 10 — LLM06
Key Finding
ADRA+ outperforms Min-K%++ by 18.8% on BookMIA (pre-training detection) and 7.6% on AIME (post-training detection), with an average 10.7% improvement over the prior runner-up across all settings.
ADRA (Active Data Reconstruction Attack)
Novel technique introduced
Detecting LLM training data is generally framed as a membership inference attack (MIA) problem. However, conventional MIAs operate passively on fixed model weights, using log-likelihoods or text generations. In this work, we introduce \textbf{Active Data Reconstruction Attack} (ADRA), a family of MIA that actively induces a model to reconstruct a given text through training. We hypothesize that training data are \textit{more reconstructible} than non-members, and the difference in their reconstructibility can be exploited for membership inference. Motivated by findings that reinforcement learning (RL) sharpens behaviors already encoded in weights, we leverage on-policy RL to actively elicit data reconstruction by finetuning a policy initialized from the target model. To effectively use RL for MIA, we design reconstruction metrics and contrastive rewards. The resulting algorithms, \textsc{ADRA} and its adaptive variant \textsc{ADRA+}, improve both reconstruction and detection given a pool of candidate data. Experiments show that our methods consistently outperform existing MIAs in detecting pre-training, post-training, and distillation data, with an average improvement of 10.7\% over the previous runner-up. In particular, \MethodPlus~improves over Min-K\%++ by 18.8\% on BookMIA for pre-training detection and by 7.6\% on AIME for post-training detection.
Key Contributions
- Active Data Reconstruction Attack (ADRA/ADRA+): a novel MIA framework that fine-tunes a policy initialized from the target LLM using on-policy RL to elicit training data reconstruction as a membership signal
- Design of reconstruction-based reward metrics and contrastive RL objectives that make RL effective for membership inference
- Demonstrated average 10.7% improvement over prior MIAs across pre-training, post-training, and distillation detection settings, with +18.8% on BookMIA and +7.6% on AIME over Min-K%++
🛡️ Threat Analysis
ADRA is explicitly framed as a membership inference attack (MIA) — its primary goal is determining whether specific text was in an LLM's training set. The RL-based reconstruction is the novel signal used to make this binary determination. Results are reported as detection accuracy improvements over prior MIAs (Min-K%++, etc.).