attack 2026

Learning to Detect Language Model Training Data via Active Reconstruction

0 citations · 49 references · arXiv (Cornell University)

Published on arXiv

2602.19020

Membership Inference Attack

OWASP ML Top 10 — ML04

Sensitive Information Disclosure

OWASP LLM Top 10 — LLM06

Key Finding

ADRA+ outperforms Min-K%++ by 18.8% on BookMIA (pre-training detection) and 7.6% on AIME (post-training detection), with an average 10.7% improvement over the prior runner-up across all settings.

ADRA (Active Data Reconstruction Attack)

Novel technique introduced

Detecting LLM training data is generally framed as a membership inference attack (MIA) problem. However, conventional MIAs operate passively on fixed model weights, using log-likelihoods or text generations. In this work, we introduce \textbf{Active Data Reconstruction Attack} (ADRA), a family of MIA that actively induces a model to reconstruct a given text through training. We hypothesize that training data are \textit{more reconstructible} than non-members, and the difference in their reconstructibility can be exploited for membership inference. Motivated by findings that reinforcement learning (RL) sharpens behaviors already encoded in weights, we leverage on-policy RL to actively elicit data reconstruction by finetuning a policy initialized from the target model. To effectively use RL for MIA, we design reconstruction metrics and contrastive rewards. The resulting algorithms, \textsc{ADRA} and its adaptive variant \textsc{ADRA+}, improve both reconstruction and detection given a pool of candidate data. Experiments show that our methods consistently outperform existing MIAs in detecting pre-training, post-training, and distillation data, with an average improvement of 10.7\% over the previous runner-up. In particular, \MethodPlus~improves over Min-K\%++ by 18.8\% on BookMIA for pre-training detection and by 7.6\% on AIME for post-training detection.

Key Contributions

Active Data Reconstruction Attack (ADRA/ADRA+): a novel MIA framework that fine-tunes a policy initialized from the target LLM using on-policy RL to elicit training data reconstruction as a membership signal
Design of reconstruction-based reward metrics and contrastive RL objectives that make RL effective for membership inference
Demonstrated average 10.7% improvement over prior MIAs across pre-training, post-training, and distillation detection settings, with +18.8% on BookMIA and +7.6% on AIME over Min-K%++

🛡️ Threat Analysis

Membership Inference Attack

ADRA is explicitly framed as a membership inference attack (MIA) — its primary goal is determining whether specific text was in an LLM's training set. The RL-based reconstruction is the novel signal used to make this binary determination. Results are reported as detection accuracy improvements over prior MIAs (Min-K%++, etc.).

Details

Domains

nlp

Model Types

llmtransformerrl

Threat Tags

white_boxinference_time

Datasets

BookMIAAIME

Applications

llm training data detectionmembership inference on language models

Read PDF arXiv DOI

Learning to Detect Language Model Training Data via Active Reconstruction

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

AttenMIA: LLM Membership Inference Attack through Attention Signals

DCMI: A Differential Calibration Membership Inference Attack Against Retrieval-Augmented Generation

Membership Inference Attacks on LLM-based Recommender Systems

(Token-Level) InfoRMIA: Stronger Membership Inference and Memorization Assessment for LLMs

BudgetLeak: Membership Inference Attacks on RAG Systems via the Generation Budget Side Channel

Exploring Membership Inference Vulnerabilities in Clinical Large Language Models

PerProb: Indirectly Evaluating Memorization in Large Language Models

Tight and Practical Privacy Auditing for Differentially Private In-Context Learning