benchmark 2026

Agent-Fence: Mapping Security Vulnerabilities Across Deep Research Agents

Sai Puppala 1, Ismail Hossain 2, Md Jahangir Alam 2, Yoonpyo Lee 3, Jay Yoo 4, Tanzim Ahad 2, Syed Bahauddin Alam 4, Sajedul Talukder 2

0 citations · 35 references · arXiv (Cornell University)

α

Published on arXiv

2602.07652

Excessive Agency

OWASP LLM Top 10 — LLM08

Insecure Plugin Design

OWASP LLM Top 10 — LLM07

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Architectural variation in mean security break rate spans 0.29 (LangGraph) to 0.51 (AutoGPT), with Denial-of-Wallet (0.62) and Authorization Confusion (0.54) as the highest-risk attack classes — far exceeding prompt-centric attacks (<0.20).

AgentFence

Novel technique introduced


Large language models are increasingly deployed as *deep agents* that plan, maintain persistent state, and invoke external tools, shifting safety failures from unsafe text to unsafe *trajectories*. We introduce **AgentFence**, an architecture-centric security evaluation that defines 14 trust-boundary attack classes spanning planning, memory, retrieval, tool use, and delegation, and detects failures via *trace-auditable conversation breaks* (unauthorized or unsafe tool use, wrong-principal actions, state/objective integrity violations, and attack-linked deviations). Holding the base model fixed, we evaluate eight agent archetypes under persistent multi-turn interaction and observe substantial architectural variation in mean security break rate (MSBR), ranging from $0.29 \pm 0.04$ (LangGraph) to $0.51 \pm 0.07$ (AutoGPT). The highest-risk classes are operational: Denial-of-Wallet ($0.62 \pm 0.08$), Authorization Confusion ($0.54 \pm 0.10$), Retrieval Poisoning ($0.47 \pm 0.09$), and Planning Manipulation ($0.44 \pm 0.11$), while prompt-centric classes remain below $0.20$ under standard settings. Breaks are dominated by boundary violations (SIV 31%, WPA 27%, UTI+UTA 24%, ATD 18%), and authorization confusion correlates with objective and tool hijacking ($ρ\approx 0.63$ and $ρ\approx 0.58$). AgentFence reframes agent security around what matters operationally: whether an agent stays within its goal and authority envelope over time.


Key Contributions

  • Unified taxonomy of 14 trust-boundary attack classes covering prompt/state injection, planning manipulation, tool hijacking, retrieval poisoning, delegation abuse, authorization confusion, and denial-of-wallet for deep research agents
  • AgentFence evaluation protocol that isolates architectural vulnerability by holding the base model fixed while varying control flow, memory handling, tool permissions, and delegation across eight agent archetypes
  • Formalization of trace-auditable 'conversation breaks' as objective, outcome-based security criteria enabling evaluation of unsafe trajectories beyond text-only policy judgments

🛡️ Threat Analysis


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
inference_timeblack_box
Datasets
LangGraphAutoGPT
Applications
llm agentsautonomous research assistantsagentic ai systemsmulti-agent orchestration frameworks