benchmark 2025

Code Agent can be an End-to-end System Hacker: Benchmarking Real-world Threats of Computer-use Agent

Weidi Luo ¹, Qiming Zhang ², Tianyu Lu ², Xiaogeng Liu ³, Bin Hu ⁴, Hung-Chun Chiu ⁵, Siyuan Ma ⁶, Yizhe Zhang ⁷, Xusheng Xiao ⁸, Yinzhi Cao ³, Zhen Xiang ¹, Chaowei Xiao ^2,3

¹ University of Georgia

² University of Wisconsin–Madison

³ Johns Hopkins University

⁴ University of Maryland, College Park

⁵ Hong Kong University of Science and Technology

⁶ Chinese University of Hong Kong

⁷ Apple

⁸ Arizona State University

4 citations · 61 references · arXiv

Published on arXiv

2510.06607

Excessive Agency

OWASP LLM Top 10 — LLM08

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Cursor CLI achieves 69.59% average attack success rate on TTP tasks, while Cursor IDE reaches 34.62% on end-to-end kill chains, demonstrating CUAs can lower the bar for complex enterprise intrusions without deep domain expertise.

AdvCUA

Novel technique introduced

Computer-use agent (CUA) frameworks, powered by large language models (LLMs) or multimodal LLMs (MLLMs), are rapidly maturing as assistants that can perceive context, reason, and act directly within software environments. Among their most critical applications is operating system (OS) control. As CUAs in the OS domain become increasingly embedded in daily operations, it is imperative to examine their real-world security implications, specifically whether CUAs can be misused to perform realistic, security-relevant attacks. Existing works exhibit four major limitations: Missing attacker-knowledge model on tactics, techniques, and procedures (TTP), Incomplete coverage for end-to-end kill chains, unrealistic environment without multi-host and encrypted user credentials, and unreliable judgment dependent on LLM-as-a-Judge. To address these gaps, we propose AdvCUA, the first benchmark aligned with real-world TTPs in MITRE ATT&CK Enterprise Matrix, which comprises 140 tasks, including 40 direct malicious tasks, 74 TTP-based malicious tasks, and 26 end-to-end kill chains, systematically evaluates CUAs under a realistic enterprise OS security threat in a multi-host environment sandbox by hard-coded evaluation. We evaluate the existing five mainstream CUAs, including ReAct, AutoGPT, Gemini CLI, Cursor CLI, and Cursor IDE based on 8 foundation LLMs. The results demonstrate that current frontier CUAs do not adequately cover OS security-centric threats. These capabilities of CUAs reduce dependence on custom malware and deep domain expertise, enabling even inexperienced attackers to mount complex enterprise intrusions, which raises social concern about the responsibility and security of CUAs.

Key Contributions

AdvCUA: the first CUA security benchmark aligned with MITRE ATT&CK Enterprise Matrix, comprising 140 tasks (40 direct malicious, 74 TTP-based, 26 end-to-end kill chains) in a realistic multi-host enterprise sandbox
Hard-coded deterministic evaluation methodology (Match/Trigger/Probe/Verify) replacing unreliable LLM-as-a-Judge assessment
Systematic evaluation of 5 mainstream CUAs (ReAct, AutoGPT, Gemini CLI, Cursor CLI, Cursor IDE) across 8 foundation LLMs, revealing Cursor CLI achieves 69.59% ASR on TTP tasks and Cursor IDE 34.62% on kill chain tasks

🛡️ Threat Analysis

Details

Domains

nlpmultimodal

Model Types

llmvlm

Threat Tags

inference_timeblack_box

Datasets

AdvCUA (140 tasks, MITRE ATT&CK-aligned, authors' own)

Applications

operating system controlenterprise securityllm agent safety evaluation

Read PDF arXiv DOI Code

Code Agent can be an End-to-end System Hacker: Benchmarking Real-world Threats of Computer-use Agent

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

SecureWebArena: A Holistic Security Evaluation Benchmark for LVLM-based Web Agents

Measuring the Security of Mobile LLM Agents under Adversarial Prompts from Untrusted Third-Party Channels

Beyond Model Jailbreak: Systematic Dissection of the "Ten DeadlySins" in Embodied Intelligence

Can LLMs Threaten Human Survival? Benchmarking Potential Existential Threats from LLMs via Prefix Completion

Blind Gods and Broken Screens: Architecting a Secure, Intent-Centric Mobile Agent Operating System

ClawSafety: "Safe" LLMs, Unsafe Agents

How Vulnerable Are AI Agents to Indirect Prompt Injections? Insights from a Large-Scale Public Competition

TrinityGuard: A Unified Framework for Safeguarding Multi-Agent Systems