defense 2025

Indirect Prompt Injections: Are Firewalls All You Need, or Stronger Benchmarks?

Rishika Bhagwatkar ^1,2,3, Kevin Kasa ^1,4, Abhay Puri ¹, Gabriel Huang ¹, Irina Rish ^2,3, Graham W. Taylor ^5,4, Krishnamurthy Dj Dvijotham ¹, Alexandre Lacoste ¹

¹ ServiceNow Research

² Mila - Quebec AI Institute

³ Université de Montréal

⁴ University of Guelph

⁵ Vector Institute

4 citations · 28 references · arXiv

Published on arXiv

2510.05244

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Firewall defense achieves perfect or near-perfect security (0% ASR) while maintaining high utility across four benchmarks, but is still bypassable by adaptive attacks not present in current benchmarks

Tool-Input Firewall (Minimizer) + Tool-Output Firewall (Sanitizer)

Novel technique introduced

AI agents are vulnerable to indirect prompt injection attacks, where malicious instructions embedded in external content or tool outputs cause unintended or harmful behavior. Inspired by the well-established concept of firewalls, we show that a simple, modular and model-agnostic defense operating at the agent--tool interface achieves perfect security (0% or the lowest possible attack success rate) with high utility (task success rate) across four public benchmarks: AgentDojo, Agent Security Bench, InjecAgent and tau-Bench, while achieving a state-of-the-art security-utility tradeoff compared to prior results. Specifically, we employ a defense based on two firewalls: a Tool-Input Firewall (Minimizer) and a Tool-Output Firewall (Sanitizer). Unlike prior complex approaches, this firewall defense makes minimal assumptions on the agent and can be deployed out-of-the-box, while maintaining strong performance without compromising utility. However, our analysis also reveals critical limitations in these existing benchmarks, including flawed success metrics, implementation bugs, and most importantly, weak attacks, hindering significant progress in the field. To foster more meaningful progress, we present targeted fixes to these issues for AgentDojo and Agent Security Bench while proposing best-practices for more robust benchmark design. Further, we demonstrate that although these firewalls push the state-of-the-art on existing benchmarks, it is still possible to bypass them in practice, underscoring the need to incorporate stronger attacks in security benchmarks. Overall, our work shows that existing agentic security benchmarks are easily saturated by a simple approach and highlights the need for stronger agentic security benchmarks with carefully chosen evaluation metrics and strong adaptive attacks.

Key Contributions

Modular, model-agnostic two-component firewall (Tool-Input Minimizer + Tool-Output Sanitizer) achieving 0% or lowest possible attack success rate across AgentDojo, Agent Security Bench, InjecAgent, and tau-Bench
Systematic analysis exposing critical flaws in existing agentic security benchmarks — including flawed success metrics, implementation bugs, and weak attacks — with targeted fixes for two benchmarks
Empirical demonstration that the proposed firewalls remain bypassable in practice by stronger adaptive attacks, motivating better benchmark design with more robust adversarial evaluation

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llm

Threat Tags

inference_timeblack_box

Datasets

AgentDojoAgent Security BenchInjecAgenttau-Bench

Applications

llm agentsagentic ai systemstool-augmented llms

Read PDF arXiv DOI

Indirect Prompt Injections: Are Firewalls All You Need, or Stronger Benchmarks?

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

LLM Reinforcement in Context

Active Honeypot Guardrail System: Probing and Confirming Multi-Turn LLM Jailbreaks

CLASP: Defending Hybrid Large Language Models Against Hidden State Poisoning Attacks

AttriGuard: Defeating Indirect Prompt Injection in LLM Agents via Causal Attribution of Tool Invocations

Cross-Service Threat Intelligence in LLM Services using Privacy-Preserving Fingerprints

Broken-Token: Filtering Obfuscated Prompts by Counting Characters-Per-Token

EASE: Practical and Efficient Safety Alignment for Small Language Models

Towards Unsupervised Adversarial Document Detection in Retrieval Augmented Generation Systems