A Two-Stage Globally-Diverse Adversarial Attack for Vision-Language Pre-training Models
Wutao Chen 1, Huaqin Zou 1, Chen Wan 1, Lifeng Huang 1,2
Published on arXiv
2601.12304
Input Manipulation Attack
OWASP ML Top 10 — ML01
Key Finding
2S-GDA improves attack success rates by up to 11.17% over state-of-the-art methods in black-box settings on VLP models including CLIP and ALBEF architectures.
2S-GDA
Novel technique introduced
Vision-language pre-training (VLP) models are vulnerable to adversarial examples, particularly in black-box scenarios. Existing multimodal attacks often suffer from limited perturbation diversity and unstable multi-stage pipelines. To address these challenges, we propose 2S-GDA, a two-stage globally-diverse attack framework. The proposed method first introduces textual perturbations through a globally-diverse strategy by combining candidate text expansion with globally-aware replacement. To enhance visual diversity, image-level perturbations are generated using multi-scale resizing and block-shuffle rotation. Extensive experiments on VLP models demonstrate that 2S-GDA consistently improves attack success rates over state-of-the-art methods, with gains of up to 11.17\% in black-box settings. Our framework is modular and can be easily combined with existing methods to further enhance adversarial transferability.
Key Contributions
- Two-stage attack framework (2S-GDA) that stabilizes multimodal adversarial pipelines by removing the unstable final text re-perturbation stage used in prior work (e.g., SGA)
- Globally-diverse textual perturbation strategy combining BERT-MLM candidate expansion with globally-aware replacement to increase semantic diversity
- Visual perturbation enhancement via multi-scale resizing and block-shuffle rotation (BSR), yielding up to 11.17% higher attack success rate over SOTA in black-box settings
🛡️ Threat Analysis
Proposes adversarial perturbations (both text-level word substitutions and image-level multi-scale/BSR transformations) that manipulate VLP model inputs at inference time, causing multimodal semantic alignment failures — the core definition of an input manipulation attack.