From Unfamiliar to Familiar: Detecting Pre-training Data via Gradient Deviations in Large Language Models

Pre-training data detection for LLMs is essential for addressing copyright concerns and mitigating benchmark contamination. Existing methods mainly focus on the likelihood-based statistical features or heuristic signals before and after fine-tuning, but the former are susceptible to word frequency bias in corpora, and the latter strongly depend on the similarity of fine-tuning data. From an optimization perspective, we observe that during training, samples transition from unfamiliar to familiar in a manner reflected by systematic differences in gradient behavior. Familiar samples exhibit smaller update magnitudes, distinct update locations in model components, and more sharply activated neurons. Based on this insight, we propose GDS, a method that identifies pre-training data by probing Gradient Deviation Scores of target samples. Specifically, we first represent each sample using gradient profiles that capture the magnitude, location, and concentration of parameter updates across FFN and Attention modules, revealing consistent distinctions between member and non-member data. These features are then fed into a lightweight classifier to perform binary membership inference. Experiments on five public datasets show that GDS achieves state-of-the-art performance with significantly improved cross-dataset transferability over strong baselines. Further interpretability analyse show gradient feature distribution differences, enabling practical and scalable pre-training data detection.

Key Contributions

GDS framework that represents each sample via gradient profiles (magnitude, location, concentration) across FFN and Attention modules to distinguish member from non-member data
Observation that familiar (member) samples exhibit smaller gradient update magnitudes, distinct update locations, and more sharply activated neurons compared to non-members
Lightweight MLP classifier over gradient features achieving SOTA membership inference performance with improved cross-dataset transferability over baselines like FSD

🛡️ Threat Analysis

Membership Inference Attack

Proposes a novel binary membership inference method (GDS) that determines whether a specific sample was in an LLM's pre-training data by leveraging gradient profile features — the canonical ML04 problem.

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

white_boxinference_time

Datasets

WikiMIABookMIA

Applications

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Neural Breadcrumbs: Membership Inference Attacks on LLMs Through Hidden State and Attention Pattern Analysis

AST-PAC: AST-guided Membership Inference for Code

Uncovering Pretraining Code in LLMs: A Syntax-Aware Attribution Approach

Effective Code Membership Inference for Code Completion Models via Adversarial Prompts

Win-k: Improved Membership Inference Attacks on Small Language Models

G-Drift MIA: Membership Inference via Gradient-Induced Feature Drift in LLMs

In-Context Probing for Membership Inference in Fine-Tuned Language Models

Membership Inference on LLMs in the Wild