Lost in Modality: Evaluating the Effectiveness of Text-Based Membership Inference Attacks on Large Multimodal Models
Ziyi Tong , Feifei Sun , Le Minh Nguyen
Published on arXiv
2512.03121
Membership Inference Attack
OWASP ML Top 10 — ML04
Key Finding
In-distribution logit-based MIAs achieve comparable performance under V+T and T-only conditions, but in OOD settings visual inputs mask membership signals, substantially degrading attack effectiveness.
Large Multimodal Language Models (MLLMs) are emerging as one of the foundational tools in an expanding range of applications. Consequently, understanding training-data leakage in these systems is increasingly critical. Log-probability-based membership inference attacks (MIAs) have become a widely adopted approach for assessing data exposure in large language models (LLMs), yet their effect in MLLMs remains unclear. We present the first comprehensive evaluation of extending these text-based MIA methods to multimodal settings. Our experiments under vision-and-text (V+T) and text-only (T-only) conditions across the DeepSeek-VL and InternVL model families show that in in-distribution settings, logit-based MIAs perform comparably across configurations, with a slight V+T advantage. Conversely, in out-of-distribution settings, visual inputs act as regularizers, effectively masking membership signals.
Key Contributions
- First comprehensive evaluation of text-based logit-based MIA methods (Loss, Min-K, Min-K%, Recall) applied to multimodal MLLM settings (V+T vs T-only)
- Demonstrates that visual inputs act as regularizers in OOD settings, effectively suppressing membership signals and reducing MIA effectiveness
- Shows that MIA effectiveness is highly model-dependent, driven by differing vision-text fusion architectures across DeepSeek-VL and InternVL families
🛡️ Threat Analysis
Paper's primary contribution is a comprehensive evaluation of logit-based membership inference attacks on MLLMs, determining whether specific VQA samples were in training data under grey-box conditions across V+T and T-only input configurations.