Towards Highly Transferable Vision-Language Attack via Semantic-Augmented Dynamic Contrastive Interaction
Yuanbo Li 1, Tianyang Xu 1, Cong Hu 1, Tao Zhou 1, Xiao-Jun Wu 1, Josef Kittler 2
Published on arXiv
2603.04839
Input Manipulation Attack
OWASP ML Top 10 — ML01
Key Finding
SADCA consistently surpasses state-of-the-art transfer-based adversarial attacks on VLP models across multiple datasets and tasks including image-text retrieval, image captioning, and visual grounding.
SADCA (Semantic-Augmented Dynamic Contrastive Attack)
Novel technique introduced
With the rapid advancement and widespread application of vision-language pre-training (VLP) models, their vulnerability to adversarial attacks has become a critical concern. In general, the adversarial examples can typically be designed to exhibit transferable power, attacking not only different models but also across diverse tasks. However, existing attacks on language-vision models mainly rely on static cross-modal interactions and focus solely on disrupting positive image-text pairs, resulting in limited cross-modal disruption and poor transferability. To address this issue, we propose a Semantic-Augmented Dynamic Contrastive Attack (SADCA) that enhances adversarial transferability through progressive and semantically guided perturbation. SADCA progressively disrupts cross-modal alignment through dynamic interactions between adversarial images and texts. This is accomplished by SADCA establishing a contrastive learning mechanism involving adversarial, positive and negative samples, to reinforce the semantic inconsistency of the obtained perturbations. Moreover, we empirically find that input transformations commonly used in traditional transfer-based attacks also benefit VLPs, which motivates a semantic augmentation module that increases the diversity and generalization of adversarial examples. Extensive experiments on multiple datasets and models demonstrate that SADCA significantly improves adversarial transferability and consistently surpasses state-of-the-art methods. The code is released at https://github.com/LiYuanBoJNU/SADCA.
Key Contributions
- Dynamic contrastive interaction mechanism that iteratively disrupts cross-modal semantic alignment by alternately updating adversarial images and texts using both positive and negative sample pairs
- Semantic augmentation module applying local semantic enhancement on images and mixed semantic augmentation on texts to diversify adversarial perturbations and improve generalization
- Empirical demonstration that input transformation strategies from traditional transfer attacks also benefit VLP attacks, motivating a unified framework that consistently outperforms prior state-of-the-art methods
🛡️ Threat Analysis
Proposes a novel transfer-based adversarial example attack targeting VLP models at inference time. Crafts adversarial image perturbations using gradient-based optimization on a white-box surrogate model, then transfers them to unseen black-box VLP targets. The core contribution is improving adversarial transferability through dynamic contrastive interactions and semantic augmentation — a canonical input manipulation attack.