attack 2026

A Two-Stage Globally-Diverse Adversarial Attack for Vision-Language Pre-training Models

Wutao Chen 1, Huaqin Zou 1, Chen Wan 1, Lifeng Huang 1,2

0 citations · 17 references · arXiv

α

Published on arXiv

2601.12304

Input Manipulation Attack

OWASP ML Top 10 — ML01

Key Finding

2S-GDA improves attack success rates by up to 11.17% over state-of-the-art methods in black-box settings on VLP models including CLIP and ALBEF architectures.

2S-GDA

Novel technique introduced


Vision-language pre-training (VLP) models are vulnerable to adversarial examples, particularly in black-box scenarios. Existing multimodal attacks often suffer from limited perturbation diversity and unstable multi-stage pipelines. To address these challenges, we propose 2S-GDA, a two-stage globally-diverse attack framework. The proposed method first introduces textual perturbations through a globally-diverse strategy by combining candidate text expansion with globally-aware replacement. To enhance visual diversity, image-level perturbations are generated using multi-scale resizing and block-shuffle rotation. Extensive experiments on VLP models demonstrate that 2S-GDA consistently improves attack success rates over state-of-the-art methods, with gains of up to 11.17\% in black-box settings. Our framework is modular and can be easily combined with existing methods to further enhance adversarial transferability.


Key Contributions

  • Two-stage attack framework (2S-GDA) that stabilizes multimodal adversarial pipelines by removing the unstable final text re-perturbation stage used in prior work (e.g., SGA)
  • Globally-diverse textual perturbation strategy combining BERT-MLM candidate expansion with globally-aware replacement to increase semantic diversity
  • Visual perturbation enhancement via multi-scale resizing and block-shuffle rotation (BSR), yielding up to 11.17% higher attack success rate over SOTA in black-box settings

🛡️ Threat Analysis

Input Manipulation Attack

Proposes adversarial perturbations (both text-level word substitutions and image-level multi-scale/BSR transformations) that manipulate VLP model inputs at inference time, causing multimodal semantic alignment failures — the core definition of an input manipulation attack.


Details

Domains
visionnlpmultimodal
Model Types
vlmtransformer
Threat Tags
black_boxinference_timetargeteddigital
Datasets
Flickr30KMS-COCO
Applications
image-text retrievalvisual question answeringmultimodal semantic alignment