attack 2025

Adversarial Attacks on VQA-NLE: Exposing and Alleviating Inconsistencies in Visual Question Answering Explanations

Yahsin Yeh , Yilun Wu , Bokai Ruan , Honghan Shuai

0 citations

α

Published on arXiv

2508.12430

Input Manipulation Attack

OWASP ML Top 10 — ML01

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Adversarial image and question perturbations successfully induce contradictory explanations in VQA-NLE models across two benchmarks and two models, while knowledge-based defenses show potential for alleviating these inconsistencies


Natural language explanations in visual question answering (VQA-NLE) aim to make black-box models more transparent by elucidating their decision-making processes. However, we find that existing VQA-NLE systems can produce inconsistent explanations and reach conclusions without genuinely understanding the underlying context, exposing weaknesses in either their inference pipeline or explanation-generation mechanism. To highlight these vulnerabilities, we not only leverage an existing adversarial strategy to perturb questions but also propose a novel strategy that minimally alters images to induce contradictory or spurious outputs. We further introduce a mitigation method that leverages external knowledge to alleviate these inconsistencies, thereby bolstering model robustness. Extensive evaluations on two standard benchmarks and two widely used VQA-NLE models underscore the effectiveness of our attacks and the potential of knowledge-based defenses, ultimately revealing pressing security and reliability concerns in current VQA-NLE systems.


Key Contributions

  • Novel adversarial image perturbation strategy that minimally alters images to induce contradictory or spurious outputs in VQA-NLE models
  • Application of existing adversarial question-perturbation strategies to VQA-NLE, establishing a dual-modality attack surface
  • Knowledge-based mitigation method leveraging external knowledge to alleviate adversarial inconsistencies and bolster VQA-NLE robustness

🛡️ Threat Analysis

Input Manipulation Attack

The paper's core contribution is a novel adversarial image perturbation strategy that minimally alters visual inputs to induce contradictory or spurious outputs from VQA-NLE models at inference time, alongside an existing adversarial question-perturbation technique — both are input manipulation attacks targeting inference-time behavior.


Details

Domains
visionnlpmultimodal
Model Types
vlmtransformer
Threat Tags
white_boxinference_timetargeteddigital
Applications
visual question answeringnatural language explanation generationmultimodal ai transparency systems