benchmark 2026

Your Agent is More Brittle Than You Think: Uncovering Indirect Injection Vulnerabilities in Agentic LLMs

0 citations

Published on arXiv

2604.03870

Prompt Injection

OWASP LLM Top 10 — LLM01

Insecure Plugin Design

OWASP LLM Top 10 — LLM07

Excessive Agency

OWASP LLM Top 10 — LLM08

Key Finding

Advanced IPI attacks bypass nearly all baseline defenses; RepE-based detection intercepts unauthorized actions before execution with high accuracy across diverse LLM backbones

RepE-based circuit breaker

Novel technique introduced

The rapid deployment of open-source frameworks has significantly advanced the development of modern multi-agent systems. However, expanded action spaces, including uncontrolled privilege exposure and hidden inter-system interactions, pose severe security challenges. Specifically, Indirect Prompt Injections (IPI), which conceal malicious instructions within third-party content, can trigger unauthorized actions such as data exfiltration during normal operations. While current security evaluations predominantly rely on isolated single-turn benchmarks, the systemic vulnerabilities of these agents within complex dynamic environments remain critically underexplored. To bridge this gap, we systematically evaluate six defense strategies against four sophisticated IPI attack vectors across nine LLM backbones. Crucially, we conduct our evaluation entirely within dynamic multi-step tool-calling environments to capture the true attack surface of modern autonomous agents. Moving beyond binary success rates, our multidimensional analysis reveals a pronounced fragility. Advanced injections successfully bypass nearly all baseline defenses, and some surface-level mitigations even produce counterproductive side effects. Furthermore, while agents execute malicious instructions almost instantaneously, their internal states exhibit abnormally high decision entropy. Motivated by this latent hesitation, we investigate Representation Engineering (RepE) as a robust detection strategy. By extracting hidden states at the tool-input position, we revealed that the RepE-based circuit breaker successfully identifies and intercepts unauthorized actions before the agent commits to them, achieving high detection accuracy across diverse LLM backbones. This study exposes the limitations of current IPI defenses and provides a highly practical paradigm for building resilient multi-agent architectures.

Key Contributions

Systematic evaluation of 6 defense strategies against 4 IPI attack vectors across 9 LLM backbones in dynamic multi-step tool-calling environments
Discovery that advanced injections bypass nearly all baseline defenses, with some defenses producing counterproductive side effects
Representation Engineering (RepE)-based circuit breaker that detects unauthorized actions via hidden state analysis at tool-input position, achieving high detection accuracy

🛡️ Threat Analysis

Details

Domains

nlpmultimodal

Model Types

llm

Threat Tags

black_boxinference_timetargeted

Applications

autonomous agentsmulti-agent systemstool-calling agentsagentic workflows

Read PDF arXiv

Your Agent is More Brittle Than You Think: Uncovering Indirect Injection Vulnerabilities in Agentic LLMs

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

SoK: The Attack Surface of Agentic AI -- Tools, and Autonomy

Penetration Testing of Agentic AI: A Comparative Security Analysis Across Models and Frameworks

How Vulnerable Are AI Agents to Indirect Prompt Injections? Insights from a Large-Scale Public Competition

ClawSafety: "Safe" LLMs, Unsafe Agents

Evaluating Privilege Usage of Agents on Real-World Tools

From Assistant to Double Agent: Formalizing and Benchmarking Attacks on OpenClaw for Personalized Local AI Agent

Agentic AI Security: Threats, Defenses, Evaluation, and Open Challenges

A Safety and Security Framework for Real-World Agentic Systems