Transferability of Adversarial Attacks in Video-based MLLMs: A Cross-modal Image-to-Video Approach
Linhao Huang 1,2,3, Xue Jiang 3,4, Zhiqiang Wang 5, Wentao Mo 1,3, Xi Xiao 1,2, Yongjie Yin 6, Bo Han 4, Feng Zheng 3
3 Southern University of Science and Technology
4 Hong Kong Baptist University
Published on arXiv
2501.01042
Input Manipulation Attack
OWASP ML Top 10 — ML01
Key Finding
Black-box I2V-MLLM attack using BLIP-2 as surrogate achieves 57.98% AASR on MSVD-QA and 58.26% on MSRVTT-QA, competitive with white-box attacks on target V-MLLMs.
I2V-MLLM
Novel technique introduced
Video-based multimodal large language models (V-MLLMs) have shown vulnerability to adversarial examples in video-text multimodal tasks. However, the transferability of adversarial videos to unseen models - a common and practical real-world scenario - remains unexplored. In this paper, we pioneer an investigation into the transferability of adversarial video samples across V-MLLMs. We find that existing adversarial attack methods face significant limitations when applied in black-box settings for V-MLLMs, which we attribute to the following shortcomings: (1) lacking generalization in perturbing video features, (2) focusing only on sparse key-frames, and (3) failing to integrate multimodal information. To address these limitations and deepen the understanding of V-MLLM vulnerabilities in black-box scenarios, we introduce the Image-to-Video MLLM (I2V-MLLM) attack. In I2V-MLLM, we utilize an image-based multimodal large language model (I-MLLM) as a surrogate model to craft adversarial video samples. Multimodal interactions and spatiotemporal information are integrated to disrupt video representations within the latent space, improving adversarial transferability. Additionally, a perturbation propagation technique is introduced to handle different unknown frame sampling strategies. Experimental results demonstrate that our method can generate adversarial examples that exhibit strong transferability across different V-MLLMs on multiple video-text multimodal tasks. Compared to white-box attacks on these models, our black-box attacks (using BLIP-2 as a surrogate model) achieve competitive performance, with average attack success rate (AASR) of 57.98% on MSVD-QA and 58.26% on MSRVTT-QA for Zero-Shot VideoQA tasks, respectively.
Key Contributions
- First systematic investigation of adversarial transferability across video-based MLLMs in black-box settings
- I2V-MLLM attack using an image-based MLLM (BLIP-2) as surrogate with spatiotemporal pooling and multimodal (vision-text) cosine similarity losses to improve cross-model transferability
- Perturbation propagation technique that extends key-frame perturbations to full video clips, handling unknown frame sampling strategies of unseen target models
🛡️ Threat Analysis
Proposes I2V-MLLM, a gradient-based (PGD) adversarial attack that perturbs video frame pixels at inference time to cause misclassification in V-MLLMs across black-box transfer settings — classic input manipulation attack targeting visual inputs of multimodal models.