G-Drift MIA: Membership Inference via Gradient-Induced Feature Drift in LLMs

Large language models (LLMs) are trained on massive web-scale corpora, raising growing concerns about privacy and copyright. Membership inference attacks (MIAs) aim to determine whether a given example was used during training. Existing LLM MIAs largely rely on output probabilities or loss values and often perform only marginally better than random guessing when members and non-members are drawn from the same distribution. We introduce G-Drift MIA, a white-box membership inference method based on gradient-induced feature drift. Given a candidate (x,y), we apply a single targeted gradient-ascent step that increases its loss and measure the resulting changes in internal representations, including logits, hidden-layer activations, and projections onto fixed feature directions, before and after the update. These drift signals are used to train a lightweight logistic classifier that effectively separates members from non-members. Across multiple transformer-based LLMs and datasets derived from realistic MIA benchmarks, G-Drift substantially outperforms confidence-based, perplexity-based, and reference-based attacks. We further show that memorized training samples systematically exhibit smaller and more structured feature drift than non-members, providing a mechanistic link between gradient geometry, representation stability, and memorization. In general, our results demonstrate that small, controlled gradient interventions offer a practical tool for auditing the membership of training-data and assessing privacy risks in LLMs.

Key Contributions

Introduces gradient-induced feature drift as a membership signal: applies single gradient-ascent step and measures representation changes in logits, hidden activations, and feature projections
Demonstrates that memorized training samples exhibit smaller and more structured feature drift than non-members, linking gradient geometry to memorization
Substantially outperforms confidence-based, perplexity-based, and reference-based MIAs (including LiRA) across multiple transformer LLMs and realistic benchmarks

🛡️ Threat Analysis

Membership Inference Attack

Core contribution is a novel membership inference attack (G-Drift MIA) that determines whether specific examples were in the LLM training set by measuring gradient-induced representation changes. This directly addresses the ML04 threat: determining if a data point was used during training.

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

white_boxtraining_time

Applications

2025 1 cit.

Membership Inference Attack

67%

G-Drift MIA: Membership Inference via Gradient-Induced Feature Drift in LLMs

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

SERSEM: Selective Entropy-Weighted Scoring for Membership Inference in Code Language Models

From Unfamiliar to Familiar: Detecting Pre-training Data via Gradient Deviations in Large Language Models

On the Evidentiary Limits of Membership Inference for Copyright Auditing

Neural Breadcrumbs: Membership Inference Attacks on LLMs Through Hidden State and Attention Pattern Analysis

Membership Inference Attacks on Tokenizers of Large Language Models

LoRA and Privacy: When Random Projections Help (and When They Don't)

Membership Inference on LLMs in the Wild

Effective Code Membership Inference for Code Completion Models via Adversarial Prompts