WAInjectBench: Benchmarking Prompt Injection Detections for Web Agents

Multiple prompt injection attacks have been proposed against web agents. At the same time, various methods have been developed to detect general prompt injection attacks, but none have been systematically evaluated for web agents. In this work, we bridge this gap by presenting the first comprehensive benchmark study on detecting prompt injection attacks targeting web agents. We begin by introducing a fine-grained categorization of such attacks based on the threat model. We then construct datasets containing both malicious and benign samples: malicious text segments generated by different attacks, benign text segments from four categories, malicious images produced by attacks, and benign images from two categories. Next, we systematize both text-based and image-based detection methods. Finally, we evaluate their performance across multiple scenarios. Our key findings show that while some detectors can identify attacks that rely on explicit textual instructions or visible image perturbations with moderate to high accuracy, they largely fail against attacks that omit explicit instructions or employ imperceptible perturbations. Our datasets and code are released at: https://github.com/Norrrrrrr-lyn/WAInjectBench.

Key Contributions

First comprehensive benchmark for evaluating prompt injection detection methods specifically targeting web agents, with a fine-grained threat-model-based attack categorization
Curated datasets of malicious/benign text and image samples spanning multiple attack strategies and benign categories for rigorous evaluation
Systematic evaluation revealing that current detectors fail against prompt injection attacks that omit explicit instructions or use imperceptible image perturbations

🛡️ Threat Analysis

Input Manipulation Attack

The benchmark explicitly covers image-based attacks using imperceptible adversarial perturbations targeting visual web agents (VLMs), and evaluates whether detectors can identify these visually adversarial inputs — a core ML01 threat.

Details

Domains

nlpvisionmultimodal

Model Types

llmvlm

Threat Tags

black_boxinference_timetargeteddigital

Datasets

WAInjectBench (custom constructed)

Applications

2026 0 cit.

Input Manipulation Attack

86%

WAInjectBench: Benchmarking Prompt Injection Detections for Web Agents

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

AgentTypo: Adaptive Typographic Prompt Injection Attacks against Black-box Multimodal Agents

Text Prompt Injection of Vision Language Models

Black-box Optimization of LLM Outputs by Asking for Directions

Poisoning Prompt-Guided Sampling in Video Large Language Models

Make Anything Match Your Target: Universal Adversarial Perturbations against Closed-Source MLLMs via Multi-Crop Routed Meta Optimization

Enhancing Targeted Adversarial Attacks on Large Vision-Language Models via Intermediate Projector

Odysseus: Jailbreaking Commercial Multimodal LLM-integrated Systems via Dual Steganography

Crafting Adversarial Inputs for Large Vision-Language Models Using Black-Box Optimization