benchmark 2026

Too Helpful to Be Safe: User-Mediated Attacks on Planning and Web-Use Agents

Fengchao Chen ^1,2, Tingmin Wu ², Van Nguyen ¹, Carsten Rudolph ¹

¹ Monash University

² CSIRO’s Data61

2 citations · 31 references · arXiv

Published on arXiv

2601.10758

Prompt Injection

OWASP LLM Top 10 — LLM01

Excessive Agency

OWASP LLM Top 10 — LLM08

Key Finding

Without explicit safety prompting, trip-planning agents bypass constraints in over 92% of cases and 9 of 17 web-use agent tests reach a 100% bypass rate, showing helpfulness-safety misalignment is the primary vulnerability.

user-mediated attacks

Novel technique introduced

Large Language Models (LLMs) have enabled agents to move beyond conversation toward end-to-end task execution and become more helpful. However, this helpfulness introduces new security risks stem less from direct interface abuse than from acting on user-provided content. Existing studies on agent security largely focus on model-internal vulnerabilities or adversarial access to agent interfaces, overlooking attacks that exploit users as unintended conduits. In this paper, we study user-mediated attacks, where benign users are tricked into relaying untrusted or attacker-controlled content to agents, and analyze how commercial LLM agents respond under such conditions. We conduct a systematic evaluation of 12 commercial agents in a sandboxed environment, covering 6 trip-planning agents and 6 web-use agents, and compare agent behavior across scenarios with no, soft, and hard user-requested safety checks. Our results show that agents are too helpful to be safe by default. Without explicit safety requests, trip-planning agents bypass safety constraints in over 92% of cases, converting unverified content into confident booking guidance. Web-use agents exhibit near-deterministic execution of risky actions, with 9 out of 17 supported tests reaching a 100% bypass rate. Even when users express soft or hard safety intent, constraint bypass remains substantial, reaching up to 54.7% and 7% for trip-planning agents, respectively. These findings reveal that the primary issue is not a lack of safety capability, but its prioritization. Agents invoke safety checks only conditionally when explicitly prompted, and otherwise default to goal-driven execution. Moreover, agents lack clear task boundaries and stopping rules, frequently over-executing workflows in ways that lead to unnecessary data disclosure and real-world harm.

Key Contributions

Introduces 'user-mediated attacks' as a novel threat paradigm where benign users serve as unintended conduits for attacker-controlled content to LLM agents, bypassing attacker-in-the-loop assumptions
Systematic sandboxed evaluation of 12 commercial LLM agents (6 trip-planning, 6 web-use) across no/soft/hard safety check conditions, revealing 92%+ safety bypass by default
Identifies that agents possess safety capabilities but fail to prioritize them by default, and lack task boundary enforcement, leading to over-execution and unnecessary data disclosure

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llm

Threat Tags

black_boxinference_time

Applications

trip-planning agentsweb-use agentsllm agentic systems

Read PDF arXiv DOI

Too Helpful to Be Safe: User-Mediated Attacks on Planning and Web-Use Agents

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

ASTRA: Agentic Steerability and Risk Assessment Framework

Mind the Gap: Evaluating Model- and Agentic-Level Vulnerabilities in LLMs with Action Graphs

Helpful to a Fault: Measuring Illicit Assistance in Multi-Turn, Multilingual LLM Agents

The Silicon Psyche: Anthropomorphic Vulnerabilities in Large Language Models

CIBER: A Comprehensive Benchmark for Security Evaluation of Code Interpreter Agents

When Hallucination Costs Millions: Benchmarking AI Agents in High-Stakes Adversarial Financial Markets

Mind the GAP: Text Safety Does Not Transfer to Tool-Call Safety in LLM Agents

WebTrap Park: An Automated Platform for Systematic Security Evaluation of Web Agents