Indirect Prompt Injections: Are Firewalls All You Need, or Stronger Benchmarks?
Rishika Bhagwatkar 1,2,3, Kevin Kasa 1,4, Abhay Puri 1, Gabriel Huang 1, Irina Rish 2,3, Graham W. Taylor 5,4, Krishnamurthy Dj Dvijotham 1, Alexandre Lacoste 1
Published on arXiv
2510.05244
Prompt Injection
OWASP LLM Top 10 — LLM01
Key Finding
Firewall defense achieves perfect or near-perfect security (0% ASR) while maintaining high utility across four benchmarks, but is still bypassable by adaptive attacks not present in current benchmarks
Tool-Input Firewall (Minimizer) + Tool-Output Firewall (Sanitizer)
Novel technique introduced
AI agents are vulnerable to indirect prompt injection attacks, where malicious instructions embedded in external content or tool outputs cause unintended or harmful behavior. Inspired by the well-established concept of firewalls, we show that a simple, modular and model-agnostic defense operating at the agent--tool interface achieves perfect security (0% or the lowest possible attack success rate) with high utility (task success rate) across four public benchmarks: AgentDojo, Agent Security Bench, InjecAgent and tau-Bench, while achieving a state-of-the-art security-utility tradeoff compared to prior results. Specifically, we employ a defense based on two firewalls: a Tool-Input Firewall (Minimizer) and a Tool-Output Firewall (Sanitizer). Unlike prior complex approaches, this firewall defense makes minimal assumptions on the agent and can be deployed out-of-the-box, while maintaining strong performance without compromising utility. However, our analysis also reveals critical limitations in these existing benchmarks, including flawed success metrics, implementation bugs, and most importantly, weak attacks, hindering significant progress in the field. To foster more meaningful progress, we present targeted fixes to these issues for AgentDojo and Agent Security Bench while proposing best-practices for more robust benchmark design. Further, we demonstrate that although these firewalls push the state-of-the-art on existing benchmarks, it is still possible to bypass them in practice, underscoring the need to incorporate stronger attacks in security benchmarks. Overall, our work shows that existing agentic security benchmarks are easily saturated by a simple approach and highlights the need for stronger agentic security benchmarks with carefully chosen evaluation metrics and strong adaptive attacks.
Key Contributions
- Modular, model-agnostic two-component firewall (Tool-Input Minimizer + Tool-Output Sanitizer) achieving 0% or lowest possible attack success rate across AgentDojo, Agent Security Bench, InjecAgent, and tau-Bench
- Systematic analysis exposing critical flaws in existing agentic security benchmarks — including flawed success metrics, implementation bugs, and weak attacks — with targeted fixes for two benchmarks
- Empirical demonstration that the proposed firewalls remain bypassable in practice by stronger adaptive attacks, motivating better benchmark design with more robust adversarial evaluation