What Hard Tokens Reveal: Exploiting Low-confidence Tokens for Membership Inference Attacks against Large Language Models
Md Tasnim Jawad 1, Mingyan Xiao 2, Yanzhao Wu 1
Published on arXiv
2601.20885
Membership Inference Attack
OWASP ML Top 10 — ML04
Key Finding
HT-MIA consistently outperforms seven state-of-the-art MIA baselines across four datasets by exploiting token-level probability improvements at hard tokens to surface stronger membership signals than sequence-level aggregation methods.
HT-MIA
Novel technique introduced
With the widespread adoption of Large Language Models (LLMs) and increasingly stringent privacy regulations, protecting data privacy in LLMs has become essential, especially for privacy-sensitive applications. Membership Inference Attacks (MIAs) attempt to determine whether a specific data sample was included in the model training/fine-tuning dataset, posing serious privacy risks. However, most existing MIA techniques against LLMs rely on sequence-level aggregated prediction statistics, which fail to distinguish prediction improvements caused by generalization from those caused by memorization, leading to low attack effectiveness. To address this limitation, we propose a novel membership inference approach that captures the token-level probabilities for low-confidence (hard) tokens, where membership signals are more pronounced. By comparing token-level probability improvements at hard tokens between a fine-tuned target model and a pre-trained reference model, HT-MIA isolates strong and robust membership signals that are obscured by prior MIA approaches. Extensive experiments on both domain-specific medical datasets and general-purpose benchmarks demonstrate that HT-MIA consistently outperforms seven state-of-the-art MIA baselines. We further investigate differentially private training as an effective defense mechanism against MIAs in LLMs. Overall, our HT-MIA framework establishes hard-token based analysis as a state-of-the-art foundation for advancing membership inference attacks and defenses for LLMs.
Key Contributions
- Token-level membership inference approach (HT-MIA) that isolates membership signals at low-confidence (hard) tokens by comparing probability improvements between a fine-tuned target model and a pre-trained reference model
- Consistent outperformance of seven state-of-the-art MIA baselines across four datasets spanning medical and general domains
- Investigation of differentially private training as an effective defense mechanism against MIAs on fine-tuned LLMs
🛡️ Threat Analysis
HT-MIA is a membership inference attack that determines whether specific sequences were in an LLM's fine-tuning dataset by exploiting token-level probability improvements at low-confidence (hard) tokens — a direct, novel MIA contribution evaluated against 7 state-of-the-art baselines.