When Alignment Fails: Multimodal Adversarial Attacks on Vision-Language-Action Models
Yuping Yan 1, Yuhan Xie 1, Yixin Zhang 2, Lingjuan Lyu 3, Handing Wang 4, Yaochu Jin 1
Published on arXiv
2511.16203
Input Manipulation Attack
OWASP ML Top 10 — ML01
Prompt Injection
OWASP LLM Top 10 — LLM01
Key Finding
Even minor multimodal perturbations cause significant behavioral deviations in a fine-tuned OpenVLA model, exposing critical fragility in cross-modal alignment under both white-box and black-box conditions.
VLA-Fool
Novel technique introduced
Vision-Language-Action models (VLAs) have recently demonstrated remarkable progress in embodied environments, enabling robots to perceive, reason, and act through unified multimodal understanding. Despite their impressive capabilities, the adversarial robustness of these systems remains largely unexplored, especially under realistic multimodal and black-box conditions. Existing studies mainly focus on single-modality perturbations and overlook the cross-modal misalignment that fundamentally affects embodied reasoning and decision-making. In this paper, we introduce VLA-Fool, a comprehensive study of multimodal adversarial robustness in embodied VLA models under both white-box and black-box settings. VLA-Fool unifies three levels of multimodal adversarial attacks: (1) textual perturbations through gradient-based and prompt-based manipulations, (2) visual perturbations via patch and noise distortions, and (3) cross-modal misalignment attacks that intentionally disrupt the semantic correspondence between perception and instruction. We further incorporate a VLA-aware semantic space into linguistic prompts, developing the first automatically crafted and semantically guided prompting framework. Experiments on the LIBERO benchmark using a fine-tuned OpenVLA model reveal that even minor multimodal perturbations can cause significant behavioral deviations, demonstrating the fragility of embodied multimodal alignment.
Key Contributions
- VLA-Fool: a unified multimodal adversarial attack framework covering textual (gradient-based and prompt-based), visual (patch and noise), and cross-modal misalignment attacks against VLA models
- First cross-modal misalignment attack that intentionally disrupts semantic correspondence between visual perception and language instruction in embodied multimodal systems
- Automatically crafted, VLA-aware semantically guided prompting framework — the first of its kind for embodied VLA adversarial evaluation
🛡️ Threat Analysis
Proposes adversarial visual perturbations (patches and noise distortions) against VLA models at inference time, as well as gradient-based textual perturbations — classic input manipulation attack methodology applied to multimodal embodied models.