attack 2026

Towards Highly Transferable Vision-Language Attack via Semantic-Augmented Dynamic Contrastive Interaction

Yuanbo Li 1, Tianyang Xu 1, Cong Hu 1, Tao Zhou 1, Xiao-Jun Wu 1, Josef Kittler 2

0 citations

α

Published on arXiv

2603.04839

Input Manipulation Attack

OWASP ML Top 10 — ML01

Key Finding

SADCA consistently surpasses state-of-the-art transfer-based adversarial attacks on VLP models across multiple datasets and tasks including image-text retrieval, image captioning, and visual grounding.

SADCA (Semantic-Augmented Dynamic Contrastive Attack)

Novel technique introduced


With the rapid advancement and widespread application of vision-language pre-training (VLP) models, their vulnerability to adversarial attacks has become a critical concern. In general, the adversarial examples can typically be designed to exhibit transferable power, attacking not only different models but also across diverse tasks. However, existing attacks on language-vision models mainly rely on static cross-modal interactions and focus solely on disrupting positive image-text pairs, resulting in limited cross-modal disruption and poor transferability. To address this issue, we propose a Semantic-Augmented Dynamic Contrastive Attack (SADCA) that enhances adversarial transferability through progressive and semantically guided perturbation. SADCA progressively disrupts cross-modal alignment through dynamic interactions between adversarial images and texts. This is accomplished by SADCA establishing a contrastive learning mechanism involving adversarial, positive and negative samples, to reinforce the semantic inconsistency of the obtained perturbations. Moreover, we empirically find that input transformations commonly used in traditional transfer-based attacks also benefit VLPs, which motivates a semantic augmentation module that increases the diversity and generalization of adversarial examples. Extensive experiments on multiple datasets and models demonstrate that SADCA significantly improves adversarial transferability and consistently surpasses state-of-the-art methods. The code is released at https://github.com/LiYuanBoJNU/SADCA.


Key Contributions

  • Dynamic contrastive interaction mechanism that iteratively disrupts cross-modal semantic alignment by alternately updating adversarial images and texts using both positive and negative sample pairs
  • Semantic augmentation module applying local semantic enhancement on images and mixed semantic augmentation on texts to diversify adversarial perturbations and improve generalization
  • Empirical demonstration that input transformation strategies from traditional transfer attacks also benefit VLP attacks, motivating a unified framework that consistently outperforms prior state-of-the-art methods

🛡️ Threat Analysis

Input Manipulation Attack

Proposes a novel transfer-based adversarial example attack targeting VLP models at inference time. Crafts adversarial image perturbations using gradient-based optimization on a white-box surrogate model, then transfers them to unseen black-box VLP targets. The core contribution is improving adversarial transferability through dynamic contrastive interactions and semantic augmentation — a canonical input manipulation attack.


Details

Domains
visionnlpmultimodal
Model Types
vlmtransformer
Threat Tags
white_boxblack_boxinference_timeuntargeteddigital
Datasets
MS-COCOFlickr30K
Applications
image-text retrievalimage captioningvisual groundingvision-language pre-training models