benchmark 2026

Agent-Fence: Mapping Security Vulnerabilities Across Deep Research Agents

0 citations · 35 references · arXiv (Cornell University)

Published on arXiv

2602.07652

Excessive Agency

OWASP LLM Top 10 — LLM08

Insecure Plugin Design

OWASP LLM Top 10 — LLM07

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Architectural variation in mean security break rate spans 0.29 (LangGraph) to 0.51 (AutoGPT), with Denial-of-Wallet (0.62) and Authorization Confusion (0.54) as the highest-risk attack classes — far exceeding prompt-centric attacks (<0.20).

AgentFence

Novel technique introduced

Large language models are increasingly deployed as *deep agents* that plan, maintain persistent state, and invoke external tools, shifting safety failures from unsafe text to unsafe *trajectories*. We introduce **AgentFence**, an architecture-centric security evaluation that defines 14 trust-boundary attack classes spanning planning, memory, retrieval, tool use, and delegation, and detects failures via *trace-auditable conversation breaks* (unauthorized or unsafe tool use, wrong-principal actions, state/objective integrity violations, and attack-linked deviations). Holding the base model fixed, we evaluate eight agent archetypes under persistent multi-turn interaction and observe substantial architectural variation in mean security break rate (MSBR), ranging from $0.29 \pm 0.04$ (LangGraph) to $0.51 \pm 0.07$ (AutoGPT). The highest-risk classes are operational: Denial-of-Wallet ($0.62 \pm 0.08$), Authorization Confusion ($0.54 \pm 0.10$), Retrieval Poisoning ($0.47 \pm 0.09$), and Planning Manipulation ($0.44 \pm 0.11$), while prompt-centric classes remain below $0.20$ under standard settings. Breaks are dominated by boundary violations (SIV 31%, WPA 27%, UTI+UTA 24%, ATD 18%), and authorization confusion correlates with objective and tool hijacking ($ρ\approx 0.63$ and $ρ\approx 0.58$). AgentFence reframes agent security around what matters operationally: whether an agent stays within its goal and authority envelope over time.

Key Contributions

Unified taxonomy of 14 trust-boundary attack classes covering prompt/state injection, planning manipulation, tool hijacking, retrieval poisoning, delegation abuse, authorization confusion, and denial-of-wallet for deep research agents
AgentFence evaluation protocol that isolates architectural vulnerability by holding the base model fixed while varying control flow, memory handling, tool permissions, and delegation across eight agent archetypes
Formalization of trace-auditable 'conversation breaks' as objective, outcome-based security criteria enabling evaluation of unsafe trajectories beyond text-only policy judgments

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

inference_timeblack_box

Datasets

LangGraphAutoGPT

Applications

llm agentsautonomous research assistantsagentic ai systemsmulti-agent orchestration frameworks

Read PDF arXiv DOI

Agent-Fence: Mapping Security Vulnerabilities Across Deep Research Agents

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Evaluating Privilege Usage of Agents on Real-World Tools

From Assistant to Double Agent: Formalizing and Benchmarking Attacks on OpenClaw for Personalized Local AI Agent

Penetration Testing of Agentic AI: A Comparative Security Analysis Across Models and Frameworks

Agents of Chaos

Systems Security Foundations for Agentic Computing

Governance Architecture for Autonomous Agent Systems: Threats, Framework, and Engineering Practice

Agentic AI as a Cybersecurity Attack Surface: Threats, Exploits, and Defenses in Runtime Supply Chains

Breaking the Code: Security Assessment of AI Code Agents Through Systematic Jailbreaking Attacks