attack 2026

Learning to Detect Language Model Training Data via Active Reconstruction

Junjie Oscar Yin 1, John X. Morris 2, Vitaly Shmatikov 2, Sewon Min 3,4, Hannaneh Hajishirzi 1,4

0 citations · 49 references · arXiv (Cornell University)

α

Published on arXiv

2602.19020

Membership Inference Attack

OWASP ML Top 10 — ML04

Sensitive Information Disclosure

OWASP LLM Top 10 — LLM06

Key Finding

ADRA+ outperforms Min-K%++ by 18.8% on BookMIA (pre-training detection) and 7.6% on AIME (post-training detection), with an average 10.7% improvement over the prior runner-up across all settings.

ADRA (Active Data Reconstruction Attack)

Novel technique introduced


Detecting LLM training data is generally framed as a membership inference attack (MIA) problem. However, conventional MIAs operate passively on fixed model weights, using log-likelihoods or text generations. In this work, we introduce \textbf{Active Data Reconstruction Attack} (ADRA), a family of MIA that actively induces a model to reconstruct a given text through training. We hypothesize that training data are \textit{more reconstructible} than non-members, and the difference in their reconstructibility can be exploited for membership inference. Motivated by findings that reinforcement learning (RL) sharpens behaviors already encoded in weights, we leverage on-policy RL to actively elicit data reconstruction by finetuning a policy initialized from the target model. To effectively use RL for MIA, we design reconstruction metrics and contrastive rewards. The resulting algorithms, \textsc{ADRA} and its adaptive variant \textsc{ADRA+}, improve both reconstruction and detection given a pool of candidate data. Experiments show that our methods consistently outperform existing MIAs in detecting pre-training, post-training, and distillation data, with an average improvement of 10.7\% over the previous runner-up. In particular, \MethodPlus~improves over Min-K\%++ by 18.8\% on BookMIA for pre-training detection and by 7.6\% on AIME for post-training detection.


Key Contributions

  • Active Data Reconstruction Attack (ADRA/ADRA+): a novel MIA framework that fine-tunes a policy initialized from the target LLM using on-policy RL to elicit training data reconstruction as a membership signal
  • Design of reconstruction-based reward metrics and contrastive RL objectives that make RL effective for membership inference
  • Demonstrated average 10.7% improvement over prior MIAs across pre-training, post-training, and distillation detection settings, with +18.8% on BookMIA and +7.6% on AIME over Min-K%++

🛡️ Threat Analysis

Membership Inference Attack

ADRA is explicitly framed as a membership inference attack (MIA) — its primary goal is determining whether specific text was in an LLM's training set. The RL-based reconstruction is the novel signal used to make this binary determination. Results are reported as detection accuracy improvements over prior MIAs (Min-K%++, etc.).


Details

Domains
nlp
Model Types
llmtransformerrl
Threat Tags
white_boxinference_time
Datasets
BookMIAAIME
Applications
llm training data detectionmembership inference on language models