AttenMIA: LLM Membership Inference Attack through Attention Signals

Large Language Models (LLMs) are increasingly deployed to enable or improve a multitude of real-world applications. Given the large size of their training data sets, their tendency to memorize training data raises serious privacy and intellectual property concerns. A key threat is the membership inference attack (MIA), which aims to determine whether a given sample was included in the model's training set. Existing MIAs for LLMs rely primarily on output confidence scores or embedding-based features, but these signals are often brittle, leading to limited attack success. We introduce AttenMIA, a new MIA framework that exploits self-attention patterns inside the transformer model to infer membership. Attention controls the information flow within the transformer, exposing different patterns for memorization that can be used to identify members of the dataset. Our method uses information from attention heads across layers and combines them with perturbation-based divergence metrics to train an effective MIA classifier. Using extensive experiments on open-source models including LLaMA-2, Pythia, and Opt models, we show that attention-based features consistently outperform baselines, particularly under the important low-false-positive metric (e.g., achieving up to 0.996 ROC AUC & 87.9% TPR@1%FPR on the WikiMIA-32 benchmark with Llama2-13b). We show that attention signals generalize across datasets and architectures, and provide a layer- and head-level analysis of where membership leakage is most pronounced. We also show that using AttenMIA to replace other membership inference attacks in a data extraction framework results in training data extraction attacks that outperform the state of the art. Our findings reveal that attention mechanisms, originally introduced to enhance interpretability, can inadvertently amplify privacy risks in LLMs, underscoring the need for new defenses.

Key Contributions

AttenMIA: a membership inference attack framework leveraging self-attention patterns across transformer layers and heads, combined with perturbation-based divergence metrics, to train a MIA classifier for LLMs
Empirical demonstration that attention-based features consistently outperform confidence- and embedding-based MIA baselines, achieving up to 0.996 ROC AUC and 87.9% TPR@1%FPR on WikiMIA-32 with LLaMA-2-13b
Layer- and head-level analysis of membership leakage localization, and demonstration that replacing MIA components in a data extraction pipeline with AttenMIA surpasses state-of-the-art extraction attacks

🛡️ Threat Analysis

Membership Inference Attack

AttenMIA's primary contribution is a new membership inference attack that determines whether a specific sample was in the LLM's training set — the binary classification task that defines ML04. Attention head features and perturbation-based divergence metrics are used to train a MIA classifier, directly targeting training data membership privacy.

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

white_boxinference_timetargeted

Datasets

WikiMIA-32MIMIR

Applications

2025 0 cit.

Membership Inference Attack

75%