defense 2026

Dual-Modality Multi-Stage Adversarial Safety Training: Robustifying Multimodal Web Agents Against Cross-Modal Attacks

Haoyu Liu ^1,2, Dingcheng Li ³, Lukas Rutishauser ², Zeyu Zheng ¹

¹ UC Berkeley

² Google

³ Google DeepMind

0 citations

Published on arXiv

2603.04364

Prompt Injection

OWASP LLM Top 10 — LLM01

Excessive Agency

OWASP LLM Top 10 — LLM08

Key Finding

DMAST doubles task completion efficiency on out-of-distribution tasks while substantially mitigating adversarial risks, significantly outperforming established training-based and prompt-based defenses.

DMAST (Dual-Modality Multi-Stage Adversarial Safety Training)

Novel technique introduced

Multimodal web agents that process both screenshots and accessibility trees are increasingly deployed to interact with web interfaces, yet their dual-stream architecture opens an underexplored attack surface: an adversary who injects content into the webpage DOM simultaneously corrupts both observation channels with a consistent deceptive narrative. Our vulnerability analysis on MiniWob++ reveals that attacks including a visual component far outperform text-only injections, exposing critical gaps in text-centric VLM safety training. Motivated by this finding, we propose Dual-Modality Multi-Stage Adversarial Safety Training (DMAST), a framework that formalizes the agent-attacker interaction as a two-player zero-sum Markov game and co-trains both players through a three-stage pipeline: (1) imitation learning from a strong teacher model, (2) oracle-guided supervised fine-tuning that uses a novel zero-acknowledgment strategy to instill task-focused reasoning under adversarial noise, and (3) adversarial reinforcement learning via Group Relative Policy Optimization (GRPO) self-play. On out-of-distribution tasks, DMAST substantially mitigates adversarial risks while simultaneously doubling task completion efficiency. Our approach significantly outperforms established training-based and prompt-based defenses, demonstrating genuine co-evolutionary progress and robust generalization to complex, unseen environments.

Key Contributions

Vulnerability analysis showing cross-modal DOM injection attacks (visual + text) dramatically outperform text-only injections on multimodal web agents, exposing gaps in text-centric VLM safety training.
DMAST: a three-stage adversarial safety training pipeline (imitation learning → oracle-guided SFT with zero-acknowledgment strategy → adversarial RL via GRPO self-play) modeled as a two-player zero-sum Markov game.
Demonstrates that DMAST substantially mitigates adversarial risk while doubling task completion efficiency on out-of-distribution tasks, outperforming both training-based and prompt-based defenses.

🛡️ Threat Analysis

Details

Domains

multimodalreinforcement-learning

Model Types

vlmmultimodalrl

Threat Tags

inference_timetargeteddigitalgrey_box

Datasets

MiniWob++

Applications

multimodal web agentsvlm-based autonomous agents

Read PDF arXiv

Dual-Modality Multi-Stage Adversarial Safety Training: Robustifying Multimodal Web Agents Against Cross-Modal Attacks

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Zero-Permission Manipulation: Can We Trust Large Multimodal Model Powered GUI Agents?

Evo-MARL: Co-Evolutionary Multi-Agent Reinforcement Learning for Internalized Safety

Toward Trustworthy Agentic AI: A Multimodal Framework for Preventing Prompt Injection Attacks

When Actions Go Off-Task: Detecting and Correcting Misaligned Actions in Computer-Use Agents

Cowpox: Towards the Immunity of VLM-based Multi-Agent Systems

Who Grants the Agent Power? Defending Against Instruction Injection via Task-Centric Access Control

CaMeLs Can Use Computers Too: System-level Security for Computer Use Agents

Atomicity for Agents: Exposing, Exploiting, and Mitigating TOCTOU Vulnerabilities in Browser-Use Agents