benchmark 2026

Do Images Speak Louder than Words? Investigating the Effect of Textual Misinformation in VLMs

0 citations · 64 references · arXiv

Published on arXiv

2601.19202

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

State-of-the-art VLMs suffer an average performance drop of over 48.2% after just one round of persuasive conflicting textual conversation, revealing a critical robustness gap.

CONTEXT-VQA

Novel technique introduced

Vision-Language Models (VLMs) have shown strong multimodal reasoning capabilities on Visual-Question-Answering (VQA) benchmarks. However, their robustness against textual misinformation remains under-explored. While existing research has studied the effect of misinformation in text-only domains, it is not clear how VLMs arbitrate between contradictory information from different modalities. To bridge the gap, we first propose the CONTEXT-VQA (i.e., Conflicting Text) dataset, consisting of image-question pairs together with systematically generated persuasive prompts that deliberately conflict with visual evidence. Then, a thorough evaluation framework is designed and executed to benchmark the susceptibility of various models to these conflicting multimodal inputs. Comprehensive experiments over 11 state-of-the-art VLMs reveal that these models are indeed vulnerable to misleading textual prompts, often overriding clear visual evidence in favor of the conflicting text, and show an average performance drop of over 48.2% after only one round of persuasive conversation. Our findings highlight a critical limitation in current VLMs and underscore the need for improved robustness against textual manipulation.

Key Contributions

CONTEXT-VQA dataset: image-question pairs paired with systematically generated persuasive prompts that deliberately conflict with visual evidence
Evaluation framework benchmarking the susceptibility of 11 state-of-the-art VLMs to conflicting multimodal inputs
Empirical finding that VLMs frequently override clear visual evidence in favor of misleading textual prompts, with an average accuracy drop of 48.2% after a single persuasive exchange

🛡️ Threat Analysis

Details

Domains

visionnlpmultimodal

Model Types

vlmmultimodal

Threat Tags

black_boxinference_time

Datasets

CONTEXT-VQA

Applications

visual question answeringmultimodal reasoning

Read PDF arXiv DOI

Do Images Speak Louder than Words? Investigating the Effect of Textual Misinformation in VLMs

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

CSR-Bench: A Benchmark for Evaluating the Cross-modal Safety and Reliability of MLLMs

MTMCS-Bench: Evaluating Contextual Safety of Multimodal Large Language Models in Multi-Turn Dialogues

FENCE: A Financial and Multimodal Jailbreak Detection Dataset

OmniSafeBench-MM: A Unified Benchmark and Toolbox for Multimodal Jailbreak Attack-Defense Evaluation

Models as Lego Builders: Assembling Malice from Benign Blocks via Semantic Blueprints

ChartAttack: Testing the Vulnerability of LLMs to Malicious Prompting in Chart Generation

Jailbreaking Large Vision Language Models in Intelligent Transportation Systems

Red-teaming the Multimodal Reasoning: Jailbreaking Vision-Language Models via Cross-modal Entanglement Attacks