What Hard Tokens Reveal: Exploiting Low-confidence Tokens for Membership Inference Attacks against Large Language Models

With the widespread adoption of Large Language Models (LLMs) and increasingly stringent privacy regulations, protecting data privacy in LLMs has become essential, especially for privacy-sensitive applications. Membership Inference Attacks (MIAs) attempt to determine whether a specific data sample was included in the model training/fine-tuning dataset, posing serious privacy risks. However, most existing MIA techniques against LLMs rely on sequence-level aggregated prediction statistics, which fail to distinguish prediction improvements caused by generalization from those caused by memorization, leading to low attack effectiveness. To address this limitation, we propose a novel membership inference approach that captures the token-level probabilities for low-confidence (hard) tokens, where membership signals are more pronounced. By comparing token-level probability improvements at hard tokens between a fine-tuned target model and a pre-trained reference model, HT-MIA isolates strong and robust membership signals that are obscured by prior MIA approaches. Extensive experiments on both domain-specific medical datasets and general-purpose benchmarks demonstrate that HT-MIA consistently outperforms seven state-of-the-art MIA baselines. We further investigate differentially private training as an effective defense mechanism against MIAs in LLMs. Overall, our HT-MIA framework establishes hard-token based analysis as a state-of-the-art foundation for advancing membership inference attacks and defenses for LLMs.

Key Contributions

Token-level membership inference approach (HT-MIA) that isolates membership signals at low-confidence (hard) tokens by comparing probability improvements between a fine-tuned target model and a pre-trained reference model
Consistent outperformance of seven state-of-the-art MIA baselines across four datasets spanning medical and general domains
Investigation of differentially private training as an effective defense mechanism against MIAs on fine-tuned LLMs

🛡️ Threat Analysis

Membership Inference Attack

HT-MIA is a membership inference attack that determines whether specific sequences were in an LLM's fine-tuning dataset by exploiting token-level probability improvements at low-confidence (hard) tokens — a direct, novel MIA contribution evaluated against 7 state-of-the-art baselines.

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

black_boxinference_time

Datasets

domain-specific medical datasetsgeneral-purpose NLP benchmarks

Applications

2025 0 cit.

Membership Inference Attack

91%