A Two-Stage Globally-Diverse Adversarial Attack for Vision-Language Pre-training Models

Vision-language pre-training (VLP) models are vulnerable to adversarial examples, particularly in black-box scenarios. Existing multimodal attacks often suffer from limited perturbation diversity and unstable multi-stage pipelines. To address these challenges, we propose 2S-GDA, a two-stage globally-diverse attack framework. The proposed method first introduces textual perturbations through a globally-diverse strategy by combining candidate text expansion with globally-aware replacement. To enhance visual diversity, image-level perturbations are generated using multi-scale resizing and block-shuffle rotation. Extensive experiments on VLP models demonstrate that 2S-GDA consistently improves attack success rates over state-of-the-art methods, with gains of up to 11.17\% in black-box settings. Our framework is modular and can be easily combined with existing methods to further enhance adversarial transferability.

Key Contributions

Two-stage attack framework (2S-GDA) that stabilizes multimodal adversarial pipelines by removing the unstable final text re-perturbation stage used in prior work (e.g., SGA)
Globally-diverse textual perturbation strategy combining BERT-MLM candidate expansion with globally-aware replacement to increase semantic diversity
Visual perturbation enhancement via multi-scale resizing and block-shuffle rotation (BSR), yielding up to 11.17% higher attack success rate over SOTA in black-box settings

🛡️ Threat Analysis

Input Manipulation Attack

Proposes adversarial perturbations (both text-level word substitutions and image-level multi-scale/BSR transformations) that manipulate VLP model inputs at inference time, causing multimodal semantic alignment failures — the core definition of an input manipulation attack.

Details

Domains

visionnlpmultimodal

Model Types

vlmtransformer

Threat Tags

black_boxinference_timetargeteddigital

Datasets

Flickr30KMS-COCO

Applications

2025 0 cit.

Input Manipulation Attack

80%