benchmark 2025

PEAR: Planner-Executor Agent Robustness Benchmark

Shen Dong ¹, Mingxuan Zhang ², Pengfei He ¹, Li Ma ¹, Bhavani Thuraisingham ³, Hui Liu ¹, Yue Xing ¹

¹ Michigan State University

² Purdue University

³ University of Texas at Dallas

0 citations · 49 references · arXiv

Published on arXiv

2510.07505

Prompt Injection

OWASP LLM Top 10 — LLM01

Excessive Agency

OWASP LLM Top 10 — LLM08

Key Finding

Attacks targeting the planner are significantly more effective than executor-targeted attacks, and stronger planner-executor pairs exhibit higher attack success rates due to their greater instruction-following capability without proportionally stronger safety mechanisms.

PEAR

Novel technique introduced

Large Language Model (LLM)-based Multi-Agent Systems (MAS) have emerged as a powerful paradigm for tackling complex, multi-step tasks across diverse domains. However, despite their impressive capabilities, MAS remain susceptible to adversarial manipulation. Existing studies typically examine isolated attack surfaces or specific scenarios, leaving a lack of holistic understanding of MAS vulnerabilities. To bridge this gap, we introduce PEAR, a benchmark for systematically evaluating both the utility and vulnerability of planner-executor MAS. While compatible with various MAS architectures, our benchmark focuses on the planner-executor structure, which is a practical and widely adopted design. Through extensive experiments, we find that (1) a weak planner degrades overall clean task performance more severely than a weak executor; (2) while a memory module is essential for the planner, having a memory module for the executor does not impact the clean task performance; (3) there exists a trade-off between task performance and robustness; and (4) attacks targeting the planner are particularly effective at misleading the system. These findings offer actionable insights for enhancing the robustness of MAS and lay the groundwork for principled defenses in multi-agent settings.

Key Contributions

PEAR benchmark with 84 clean tasks and 120 attack tasks spanning harmful action, privacy leakage, and resource exhaustion across four real-world scenarios (1,680 attack instances total)
Systematic evaluation of five attack types across two attack surfaces (planner and executor) in planner-executor MAS
Empirical finding that attacks targeting the planner are most effective and that a performance-robustness trade-off exists in MAS

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llm

Threat Tags

black_boxinference_timetargeted

Datasets

PEAR (introduced in paper)

Applications

multi-agent systemsllm agentsweb automationos control

Read PDF arXiv DOI Code

PEAR: Planner-Executor Agent Robustness Benchmark

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Mind the Gap: Comparing Model- vs Agentic-Level Red Teaming with Action-Graph Observability on GPT-OSS-20B

You Told Me to Do It: Measuring Instructional Text-induced Private Data Leakage in LLM Agents

It's a TRAP! Task-Redirecting Agent Persuasion Benchmark for Web Agents

FinVault: Benchmarking Financial Agent Safety in Execution-Grounded Environments

Mind the Gap: Evaluating Model- and Agentic-Level Vulnerabilities in LLMs with Action Graphs

WebTrap Park: An Automated Platform for Systematic Security Evaluation of Web Agents

When Hallucination Costs Millions: Benchmarking AI Agents in High-Stakes Adversarial Financial Markets

The Silicon Psyche: Anthropomorphic Vulnerabilities in Large Language Models