attack 2026

If you're waiting for a sign... that might not be it! Mitigating Trust Boundary Confusion from Visual Injections on Vision-Language Agentic Systems

Jiamin Chang 1, Minhui Xue , Ruoxi Sun 2, Shuchao Pang 3, Salil S. Kanhere 1, Hammond Pearce 1

0 citations

α

Published on arXiv

2604.19844

Input Manipulation Attack

OWASP ML Top 10 — ML01

Prompt Injection

OWASP LLM Top 10 — LLM01

Excessive Agency

OWASP LLM Top 10 — LLM08

Key Finding

Multi-agent defense significantly reduces misleading behaviors while preserving correct responses to legitimate environmental signals


Recent advances in embodied Vision-Language Agentic Systems (VLAS), powered by large vision-language models (LVLMs), enable AI systems to perceive and reason over real-world scenes. Within this context, environmental signals such as traffic lights are essential in-band signals that can and should influence agent behavior. However, similar signals could also be crafted to operate as misleading visual injections, overriding user intent and posing security risks. This duality creates a fundamental challenge: agents must respond to legitimate environmental cues while remaining robust to misleading ones. We refer to this tension as trust boundary confusion. To study this behavior, we design a dual-intent dataset and evaluation framework, through which we show that current LVLM-based agents fail to reliably balance this trade-off, either ignoring useful signals or following harmful ones. We systematically evaluate 7 LVLM agents across multiple embodied settings under both structure-based and noise-based visual injections. To address these vulnerabilities, we propose a multi-agent defense framework that separates perception from decision-making to dynamically assess the reliability of visual inputs. Our approach significantly reduces misleading behaviors while preserving correct responses and provides robustness guarantees under adversarial perturbations. The code of the evaluation framework and artifacts are made available at https://anonymous.4open.science/r/Visual-Prompt-Inject.


Key Contributions

  • Identifies and formalizes 'trust boundary confusion' in embodied VLM agents where environmental visual signals can override user instructions
  • Dual-intent dataset and evaluation framework testing 7 LVLM agents across structure-based and noise-based visual injections
  • Multi-agent defense framework separating perception from decision-making with robustness guarantees under adversarial perturbations

🛡️ Threat Analysis

Prompt Injection

Paper explicitly addresses prompt injection in vision-language models where visual inputs override user textual instructions, causing agents to follow malicious environmental cues instead of legitimate commands.

Input Manipulation Attack

Paper proposes adversarial visual inputs (both structure-based and noise-based perturbations) that manipulate VLM agent behavior at inference time, causing misalignment with user intent.


Details

Domains
visionmultimodal
Model Types
vlmmultimodaltransformer
Threat Tags
black_boxwhite_boxinference_timephysicaldigital
Applications
autonomous drivingdrone emergency landingrobotic systemsembodied ai agents