AttenMIA: LLM Membership Inference Attack through Attention Signals
Pedram Zaree 1, Md Abdullah Al Mamun 1, Yue Dong 1, Ihsen Alouani 2, Nael Abu-Ghazaleh 1
Published on arXiv
2601.18110
Membership Inference Attack
OWASP ML Top 10 — ML04
Sensitive Information Disclosure
OWASP LLM Top 10 — LLM06
Key Finding
AttenMIA achieves 87.9% TPR@1%FPR and 0.996 ROC AUC on WikiMIA-32 with LLaMA-2-13b, outperforming confidence-score and embedding-based MIA baselines across LLaMA-2, Pythia, and OPT model families.
AttenMIA
Novel technique introduced
Large Language Models (LLMs) are increasingly deployed to enable or improve a multitude of real-world applications. Given the large size of their training data sets, their tendency to memorize training data raises serious privacy and intellectual property concerns. A key threat is the membership inference attack (MIA), which aims to determine whether a given sample was included in the model's training set. Existing MIAs for LLMs rely primarily on output confidence scores or embedding-based features, but these signals are often brittle, leading to limited attack success. We introduce AttenMIA, a new MIA framework that exploits self-attention patterns inside the transformer model to infer membership. Attention controls the information flow within the transformer, exposing different patterns for memorization that can be used to identify members of the dataset. Our method uses information from attention heads across layers and combines them with perturbation-based divergence metrics to train an effective MIA classifier. Using extensive experiments on open-source models including LLaMA-2, Pythia, and Opt models, we show that attention-based features consistently outperform baselines, particularly under the important low-false-positive metric (e.g., achieving up to 0.996 ROC AUC & 87.9% TPR@1%FPR on the WikiMIA-32 benchmark with Llama2-13b). We show that attention signals generalize across datasets and architectures, and provide a layer- and head-level analysis of where membership leakage is most pronounced. We also show that using AttenMIA to replace other membership inference attacks in a data extraction framework results in training data extraction attacks that outperform the state of the art. Our findings reveal that attention mechanisms, originally introduced to enhance interpretability, can inadvertently amplify privacy risks in LLMs, underscoring the need for new defenses.
Key Contributions
- AttenMIA: a membership inference attack framework leveraging self-attention patterns across transformer layers and heads, combined with perturbation-based divergence metrics, to train a MIA classifier for LLMs
- Empirical demonstration that attention-based features consistently outperform confidence- and embedding-based MIA baselines, achieving up to 0.996 ROC AUC and 87.9% TPR@1%FPR on WikiMIA-32 with LLaMA-2-13b
- Layer- and head-level analysis of membership leakage localization, and demonstration that replacing MIA components in a data extraction pipeline with AttenMIA surpasses state-of-the-art extraction attacks
🛡️ Threat Analysis
AttenMIA's primary contribution is a new membership inference attack that determines whether a specific sample was in the LLM's training set — the binary classification task that defines ML04. Attention head features and perturbation-based divergence metrics are used to train a MIA classifier, directly targeting training data membership privacy.