ADVEDM:Fine-grained Adversarial Attack against VLM-based Embodied Agents

Vision-Language Models (VLMs), with their strong reasoning and planning capabilities, are widely used in embodied decision-making (EDM) tasks in embodied agents, such as autonomous driving and robotic manipulation. Recent research has increasingly explored adversarial attacks on VLMs to reveal their vulnerabilities. However, these attacks either rely on overly strong assumptions, requiring full knowledge of the victim VLM, which is impractical for attacking VLM-based agents, or exhibit limited effectiveness. The latter stems from disrupting most semantic information in the image, which leads to a misalignment between the perception and the task context defined by system prompts. This inconsistency interrupts the VLM's reasoning process, resulting in invalid outputs that fail to affect interactions in the physical world. To this end, we propose a fine-grained adversarial attack framework, ADVEDM, which modifies the VLM's perception of only a few key objects while preserving the semantics of the remaining regions. This attack effectively reduces conflicts with the task context, making VLMs output valid but incorrect decisions and affecting the actions of agents, thus posing a more substantial safety threat in the physical world. We design two variants of based on this framework, ADVEDM-R and ADVEDM-A, which respectively remove the semantics of a specific object from the image and add the semantics of a new object into the image. The experimental results in both general scenarios and EDM tasks demonstrate fine-grained control and excellent attack performance.

Key Contributions

Fine-grained adversarial attack framework (ADVEDM) that perturbs only key object regions while preserving surrounding semantics, reducing misalignment with task context in system prompts.
Two attack variants: ADVEDM-R (erases object semantics) and ADVEDM-A (injects semantics of a new phantom object) enabling controlled manipulation of VLM perception.
Demonstrates that fine-grained, object-targeted attacks produce valid but incorrect VLM decisions, making them more practically dangerous in embodied agent settings (autonomous driving, robotic manipulation) than coarse-grained global attacks.

🛡️ Threat Analysis

Input Manipulation Attack

Paper's primary contribution is adversarial visual perturbations crafted at inference time to cause VLMs to misperceive specific objects, resulting in misclassification/incorrect decisions — a canonical input manipulation attack on visual inputs.

Details

Domains

visionmultimodal

Model Types

vlmmultimodal

Threat Tags

black_boxinference_timetargeteddigital

Applications

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Extended to Reality: Prompt Injection in 3D Environments

Chameleon: Adaptive Adversarial Agents for Scaling-Based Visual Prompt Injection in Multimodal AI Systems

Medusa: Cross-Modal Transferable Adversarial Attacks on Multimodal Medical Retrieval-Augmented Generation

PhysPatch: A Physically Realizable and Transferable Adversarial Patch Attack for Multimodal Large Language Models-based Autonomous Driving Systems

VII: Visual Instruction Injection for Jailbreaking Image-to-Video Generation Models

Robustness of Vision Language Models Against Split-Image Harmful Input Attacks

Few Tokens Matter: Entropy Guided Attacks on Vision-Language Models

Adversarial Prompt Injection Attack on Multimodal Large Language Models