tool 2026

MUZZLE: Adaptive Agentic Red-Teaming of Web Agents Against Indirect Prompt Injection Attacks

Georgios Syros ¹, Evan Rose ¹, Brian Grinstead ², Christoph Kerschbaumer ², William Robertson ¹, Cristina Nita-Rotaru ¹, Alina Oprea ¹

¹ Northeastern University

² Mozilla Corporation

0 citations · 79 references · arXiv (Cornell University)

Published on arXiv

2602.09222

Prompt Injection

OWASP LLM Top 10 — LLM01

Excessive Agency

OWASP LLM Top 10 — LLM08

Key Finding

MUZZLE automatically discovers 37 new indirect prompt injection attacks across 4 web applications with 10 adversarial objectives, including cross-application attacks and a novel agent-tailored phishing strategy.

MUZZLE

Novel technique introduced

Large language model (LLM) based web agents are increasingly deployed to automate complex online tasks by directly interacting with web sites and performing actions on users' behalf. While these agents offer powerful capabilities, their design exposes them to indirect prompt injection attacks embedded in untrusted web content, enabling adversaries to hijack agent behavior and violate user intent. Despite growing awareness of this threat, existing evaluations rely on fixed attack templates, manually selected injection surfaces, or narrowly scoped scenarios, limiting their ability to capture realistic, adaptive attacks encountered in practice. We present MUZZLE, an automated agentic framework for evaluating the security of web agents against indirect prompt injection attacks. MUZZLE utilizes the agent's trajectories to automatically identify high-salience injection surfaces, and adaptively generate context-aware malicious instructions that target violations of confidentiality, integrity, and availability. Unlike prior approaches, MUZZLE adapts its attack strategy based on the agent's observed execution trajectory and iteratively refines attacks using feedback from failed executions. We evaluate MUZZLE across diverse web applications, user tasks, and agent configurations, demonstrating its ability to automatically and adaptively assess the security of web agents with minimal human intervention. Our results show that MUZZLE effectively discovers 37 new attacks on 4 web applications with 10 adversarial objectives that violate confidentiality, availability, or privacy properties. MUZZLE also identifies novel attack strategies, including 2 cross-application prompt injection attacks and an agent-tailored phishing scenario.

Key Contributions

Automated trajectory-based identification of high-salience injection surfaces in web agent executions
Adaptive, iterative attack generation that refines malicious prompts using feedback from failed injection attempts
Empirical discovery of 37 novel attacks across 4 web apps, including 2 cross-application prompt injection attacks and an agent-tailored phishing scenario

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llm

Threat Tags

black_boxinference_timetargeted

Applications

llm-based web agentsweb automationagentic ai systems

Read PDF arXiv DOI

MUZZLE: Adaptive Agentic Red-Teaming of Web Agents Against Indirect Prompt Injection Attacks

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

LAAF: Logic-layer Automated Attack Framework A Systematic Red-Teaming Methodology for LPCI Vulnerabilities in Agentic Large Language Model Systems

"Your AI, My Shell": Demystifying Prompt Injection Attacks on Agentic AI Coding Editors

DREAM: Dynamic Red-teaming across Environments for AI Models

AgentSight: System-Level Observability for AI Agents Using eBPF

ClawTrap: A MITM-Based Red-Teaming Framework for Real-World OpenClaw Security Evaluation

AJAR: Adaptive Jailbreak Architecture for Red-teaming

Attack the Messages, Not the Agents: A Multi-round Adaptive Stealthy Tampering Framework for LLM-MAS

FinVault: Benchmarking Financial Agent Safety in Execution-Grounded Environments