Agent-Fence: Mapping Security Vulnerabilities Across Deep Research Agents
Sai Puppala 1, Ismail Hossain 2, Md Jahangir Alam 2, Yoonpyo Lee 3, Jay Yoo 4, Tanzim Ahad 2, Syed Bahauddin Alam 4, Sajedul Talukder 2
Published on arXiv
2602.07652
Excessive Agency
OWASP LLM Top 10 — LLM08
Insecure Plugin Design
OWASP LLM Top 10 — LLM07
Prompt Injection
OWASP LLM Top 10 — LLM01
Key Finding
Architectural variation in mean security break rate spans 0.29 (LangGraph) to 0.51 (AutoGPT), with Denial-of-Wallet (0.62) and Authorization Confusion (0.54) as the highest-risk attack classes — far exceeding prompt-centric attacks (<0.20).
AgentFence
Novel technique introduced
Large language models are increasingly deployed as *deep agents* that plan, maintain persistent state, and invoke external tools, shifting safety failures from unsafe text to unsafe *trajectories*. We introduce **AgentFence**, an architecture-centric security evaluation that defines 14 trust-boundary attack classes spanning planning, memory, retrieval, tool use, and delegation, and detects failures via *trace-auditable conversation breaks* (unauthorized or unsafe tool use, wrong-principal actions, state/objective integrity violations, and attack-linked deviations). Holding the base model fixed, we evaluate eight agent archetypes under persistent multi-turn interaction and observe substantial architectural variation in mean security break rate (MSBR), ranging from $0.29 \pm 0.04$ (LangGraph) to $0.51 \pm 0.07$ (AutoGPT). The highest-risk classes are operational: Denial-of-Wallet ($0.62 \pm 0.08$), Authorization Confusion ($0.54 \pm 0.10$), Retrieval Poisoning ($0.47 \pm 0.09$), and Planning Manipulation ($0.44 \pm 0.11$), while prompt-centric classes remain below $0.20$ under standard settings. Breaks are dominated by boundary violations (SIV 31%, WPA 27%, UTI+UTA 24%, ATD 18%), and authorization confusion correlates with objective and tool hijacking ($ρ\approx 0.63$ and $ρ\approx 0.58$). AgentFence reframes agent security around what matters operationally: whether an agent stays within its goal and authority envelope over time.
Key Contributions
- Unified taxonomy of 14 trust-boundary attack classes covering prompt/state injection, planning manipulation, tool hijacking, retrieval poisoning, delegation abuse, authorization confusion, and denial-of-wallet for deep research agents
- AgentFence evaluation protocol that isolates architectural vulnerability by holding the base model fixed while varying control flow, memory handling, tool permissions, and delegation across eight agent archetypes
- Formalization of trace-auditable 'conversation breaks' as objective, outcome-based security criteria enabling evaluation of unsafe trajectories beyond text-only policy judgments