benchmark 2026

LPS-Bench: Benchmarking Safety Awareness of Computer-Use Agents in Long-Horizon Planning under Benign and Adversarial Scenarios

Tianyu Chen ¹, Chujia Hu ¹, Ge Gao ¹, Dongrui Liu ², Xia Hu ², Wenjie Wang ¹

¹ ShanghaiTech University

² Shanghai Artificial Intelligence Laboratory

1 citations · 1 influential · 48 references · arXiv (Cornell University)

Published on arXiv

2602.03255

Insecure Plugin Design

OWASP LLM Top 10 — LLM07

Excessive Agency

OWASP LLM Top 10 — LLM08

Key Finding

Experiments across 13 tested LLM agents reveal substantial deficiencies in maintaining safe behavior during long-horizon MCP-based workflows under both benign and adversarial conditions

LPS-Bench

Novel technique introduced

Computer-use agents (CUAs) that interact with real computer systems can perform automated tasks but face critical safety risks. Ambiguous instructions may trigger harmful actions, and adversarial users can manipulate tool execution to achieve malicious goals. Existing benchmarks mostly focus on short-horizon or GUI-based tasks, evaluating on execution-time errors but overlooking the ability to anticipate planning-time risks. To fill this gap, we present LPS-Bench, a benchmark that evaluates the planning-time safety awareness of MCP-based CUAs under long-horizon tasks, covering both benign and adversarial interactions across 65 scenarios of 7 task domains and 9 risk types. We introduce a multi-agent automated pipeline for scalable data generation and adopt an LLM-as-a-judge evaluation protocol to assess safety awareness through the planning trajectory. Experiments reveal substantial deficiencies in existing CUAs' ability to maintain safe behavior. We further analyze the risks and propose mitigation strategies to improve long-horizon planning safety in MCP-based CUA systems. We open-source our code at https://github.com/tychenn/LPS-Bench.

Key Contributions

LPS-Bench: a benchmark of 65 scenarios across 7 task domains and 9 risk types evaluating planning-time safety awareness of MCP-based CUAs under both benign and adversarial long-horizon settings
Multi-agent automated pipeline for scalable safety scenario generation with LLM-as-a-judge evaluation across full planning trajectories
Empirical evaluation of 13 LLM agents revealing substantial safety deficiencies, with analysis of risk categories and proposed mitigation strategies

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llm

Threat Tags

inference_timeblack_box

Datasets

LPS-Bench (65 scenarios, 7 task domains, 9 risk types)

Applications

computer-use agentsmcp-based autonomous agentslong-horizon task planning systems

Read PDF arXiv DOI Code

LPS-Bench: Benchmarking Safety Awareness of Computer-Use Agents in Long-Horizon Planning under Benign and Adversarial Scenarios

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

MCP-SafetyBench: A Benchmark for Safety Evaluation of Large Language Models with Real-World MCP Servers

Mind the Gap: Time-of-Check to Time-of-Use Vulnerabilities in LLM-Enabled Agents

From Tool Orchestration to Code Execution: A Study of MCP Design Choices

Securing AI Agent Execution

SafeToolBench: Pioneering a Prospective Benchmark to Evaluating Tool Utilization Safety in LLMs

ACRFence: Preventing Semantic Rollback Attacks in Agent Checkpoint-Restore

From Assistant to Double Agent: Formalizing and Benchmarking Attacks on OpenClaw for Personalized Local AI Agent

Penetration Testing of Agentic AI: A Comparative Security Analysis Across Models and Frameworks