Towards Highly Transferable Vision-Language Attack via Semantic-Augmented Dynamic Contrastive Interaction

With the rapid advancement and widespread application of vision-language pre-training (VLP) models, their vulnerability to adversarial attacks has become a critical concern. In general, the adversarial examples can typically be designed to exhibit transferable power, attacking not only different models but also across diverse tasks. However, existing attacks on language-vision models mainly rely on static cross-modal interactions and focus solely on disrupting positive image-text pairs, resulting in limited cross-modal disruption and poor transferability. To address this issue, we propose a Semantic-Augmented Dynamic Contrastive Attack (SADCA) that enhances adversarial transferability through progressive and semantically guided perturbation. SADCA progressively disrupts cross-modal alignment through dynamic interactions between adversarial images and texts. This is accomplished by SADCA establishing a contrastive learning mechanism involving adversarial, positive and negative samples, to reinforce the semantic inconsistency of the obtained perturbations. Moreover, we empirically find that input transformations commonly used in traditional transfer-based attacks also benefit VLPs, which motivates a semantic augmentation module that increases the diversity and generalization of adversarial examples. Extensive experiments on multiple datasets and models demonstrate that SADCA significantly improves adversarial transferability and consistently surpasses state-of-the-art methods. The code is released at https://github.com/LiYuanBoJNU/SADCA.

Key Contributions

Dynamic contrastive interaction mechanism that iteratively disrupts cross-modal semantic alignment by alternately updating adversarial images and texts using both positive and negative sample pairs
Semantic augmentation module applying local semantic enhancement on images and mixed semantic augmentation on texts to diversify adversarial perturbations and improve generalization
Empirical demonstration that input transformation strategies from traditional transfer attacks also benefit VLP attacks, motivating a unified framework that consistently outperforms prior state-of-the-art methods

🛡️ Threat Analysis

Input Manipulation Attack

Proposes a novel transfer-based adversarial example attack targeting VLP models at inference time. Crafts adversarial image perturbations using gradient-based optimization on a white-box surrogate model, then transfers them to unseen black-box VLP targets. The core contribution is improving adversarial transferability through dynamic contrastive interactions and semantic augmentation — a canonical input manipulation attack.

Details

Domains

visionnlpmultimodal

Model Types

vlmtransformer

Threat Tags

white_boxblack_boxinference_timeuntargeteddigital

Datasets

MS-COCOFlickr30K

Applications

2026 0 cit.

Input Manipulation Attack

80%