attack 2026

Hierarchical Refinement of Universal Multimodal Attacks on Vision-Language Models

Peng-Fei Zhang , Zi Huang

0 citations · 54 references · arXiv

α

Published on arXiv

2601.10313

Input Manipulation Attack

OWASP ML Top 10 — ML01

Key Finding

HRA achieves superior cross-model and cross-task transferability compared to sample-specific and prior universal VLP attack methods across multiple downstream tasks and datasets.

Hierarchical Refinement Attack (HRA)

Novel technique introduced


Existing adversarial attacks for VLP models are mostly sample-specific, resulting in substantial computational overhead when scaled to large datasets or new scenarios. To overcome this limitation, we propose Hierarchical Refinement Attack (HRA), a multimodal universal attack framework for VLP models. For the image modality, we refine the optimization path by leveraging a temporal hierarchy of historical and estimated future gradients to avoid local minima and stabilize universal perturbation learning. For the text modality, it hierarchically models textual importance by considering both intra- and inter-sentence contributions to identify globally influential words, which are then used as universal text perturbations. Extensive experiments across various downstream tasks, VLP models, and datasets, demonstrate the superior transferability of the proposed universal multimodal attacks.


Key Contributions

  • Future-aware momentum for universal image perturbation learning that incorporates both historical and predicted future gradients to escape local optima and improve transferability.
  • Hierarchical textual importance modeling (intra- and inter-sentence ranking) to identify universally influential words as task-agnostic text perturbations without requiring a predefined word library.
  • A unified multimodal universal attack framework (HRA) that outperforms sample-specific and prior universal methods in cross-model and cross-task transferability.

🛡️ Threat Analysis

Input Manipulation Attack

The paper's core contribution is crafting universal adversarial perturbations (UAPs) applied to both image and text modalities at inference time to disrupt cross-modal alignment in VLP models — a direct input manipulation attack using gradient-based optimization for images and importance-ranked word substitution for text.


Details

Domains
visionnlpmultimodal
Model Types
vlmtransformer
Threat Tags
white_boxblack_boxinference_timeuntargeteddigital
Applications
image-text retrievalvisual question answeringimage captioningvision-language pre-training