attack 2026

What Hard Tokens Reveal: Exploiting Low-confidence Tokens for Membership Inference Attacks against Large Language Models

Md Tasnim Jawad 1, Mingyan Xiao 2, Yanzhao Wu 1

0 citations · 58 references · arXiv

α

Published on arXiv

2601.20885

Membership Inference Attack

OWASP ML Top 10 — ML04

Key Finding

HT-MIA consistently outperforms seven state-of-the-art MIA baselines across four datasets by exploiting token-level probability improvements at hard tokens to surface stronger membership signals than sequence-level aggregation methods.

HT-MIA

Novel technique introduced


With the widespread adoption of Large Language Models (LLMs) and increasingly stringent privacy regulations, protecting data privacy in LLMs has become essential, especially for privacy-sensitive applications. Membership Inference Attacks (MIAs) attempt to determine whether a specific data sample was included in the model training/fine-tuning dataset, posing serious privacy risks. However, most existing MIA techniques against LLMs rely on sequence-level aggregated prediction statistics, which fail to distinguish prediction improvements caused by generalization from those caused by memorization, leading to low attack effectiveness. To address this limitation, we propose a novel membership inference approach that captures the token-level probabilities for low-confidence (hard) tokens, where membership signals are more pronounced. By comparing token-level probability improvements at hard tokens between a fine-tuned target model and a pre-trained reference model, HT-MIA isolates strong and robust membership signals that are obscured by prior MIA approaches. Extensive experiments on both domain-specific medical datasets and general-purpose benchmarks demonstrate that HT-MIA consistently outperforms seven state-of-the-art MIA baselines. We further investigate differentially private training as an effective defense mechanism against MIAs in LLMs. Overall, our HT-MIA framework establishes hard-token based analysis as a state-of-the-art foundation for advancing membership inference attacks and defenses for LLMs.


Key Contributions

  • Token-level membership inference approach (HT-MIA) that isolates membership signals at low-confidence (hard) tokens by comparing probability improvements between a fine-tuned target model and a pre-trained reference model
  • Consistent outperformance of seven state-of-the-art MIA baselines across four datasets spanning medical and general domains
  • Investigation of differentially private training as an effective defense mechanism against MIAs on fine-tuned LLMs

🛡️ Threat Analysis

Membership Inference Attack

HT-MIA is a membership inference attack that determines whether specific sequences were in an LLM's fine-tuning dataset by exploiting token-level probability improvements at low-confidence (hard) tokens — a direct, novel MIA contribution evaluated against 7 state-of-the-art baselines.


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
black_boxinference_time
Datasets
domain-specific medical datasetsgeneral-purpose NLP benchmarks
Applications
llm fine-tuningmedical language modelsprivacy-sensitive nlp applications