attack 2026

AttenMIA: LLM Membership Inference Attack through Attention Signals

Pedram Zaree 1, Md Abdullah Al Mamun 1, Yue Dong 1, Ihsen Alouani 2, Nael Abu-Ghazaleh 1

0 citations · 53 references · arXiv

α

Published on arXiv

2601.18110

Membership Inference Attack

OWASP ML Top 10 — ML04

Sensitive Information Disclosure

OWASP LLM Top 10 — LLM06

Key Finding

AttenMIA achieves 87.9% TPR@1%FPR and 0.996 ROC AUC on WikiMIA-32 with LLaMA-2-13b, outperforming confidence-score and embedding-based MIA baselines across LLaMA-2, Pythia, and OPT model families.

AttenMIA

Novel technique introduced


Large Language Models (LLMs) are increasingly deployed to enable or improve a multitude of real-world applications. Given the large size of their training data sets, their tendency to memorize training data raises serious privacy and intellectual property concerns. A key threat is the membership inference attack (MIA), which aims to determine whether a given sample was included in the model's training set. Existing MIAs for LLMs rely primarily on output confidence scores or embedding-based features, but these signals are often brittle, leading to limited attack success. We introduce AttenMIA, a new MIA framework that exploits self-attention patterns inside the transformer model to infer membership. Attention controls the information flow within the transformer, exposing different patterns for memorization that can be used to identify members of the dataset. Our method uses information from attention heads across layers and combines them with perturbation-based divergence metrics to train an effective MIA classifier. Using extensive experiments on open-source models including LLaMA-2, Pythia, and Opt models, we show that attention-based features consistently outperform baselines, particularly under the important low-false-positive metric (e.g., achieving up to 0.996 ROC AUC & 87.9% TPR@1%FPR on the WikiMIA-32 benchmark with Llama2-13b). We show that attention signals generalize across datasets and architectures, and provide a layer- and head-level analysis of where membership leakage is most pronounced. We also show that using AttenMIA to replace other membership inference attacks in a data extraction framework results in training data extraction attacks that outperform the state of the art. Our findings reveal that attention mechanisms, originally introduced to enhance interpretability, can inadvertently amplify privacy risks in LLMs, underscoring the need for new defenses.


Key Contributions

  • AttenMIA: a membership inference attack framework leveraging self-attention patterns across transformer layers and heads, combined with perturbation-based divergence metrics, to train a MIA classifier for LLMs
  • Empirical demonstration that attention-based features consistently outperform confidence- and embedding-based MIA baselines, achieving up to 0.996 ROC AUC and 87.9% TPR@1%FPR on WikiMIA-32 with LLaMA-2-13b
  • Layer- and head-level analysis of membership leakage localization, and demonstration that replacing MIA components in a data extraction pipeline with AttenMIA surpasses state-of-the-art extraction attacks

🛡️ Threat Analysis

Membership Inference Attack

AttenMIA's primary contribution is a new membership inference attack that determines whether a specific sample was in the LLM's training set — the binary classification task that defines ML04. Attention head features and perturbation-based divergence metrics are used to train a MIA classifier, directly targeting training data membership privacy.


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
white_boxinference_timetargeted
Datasets
WikiMIA-32MIMIR
Applications
large language model privacy auditingtraining data membership inference