benchmark 2026

Too Helpful to Be Safe: User-Mediated Attacks on Planning and Web-Use Agents

Fengchao Chen 1,2, Tingmin Wu 2, Van Nguyen 1, Carsten Rudolph 1

2 citations · 31 references · arXiv

α

Published on arXiv

2601.10758

Prompt Injection

OWASP LLM Top 10 — LLM01

Excessive Agency

OWASP LLM Top 10 — LLM08

Key Finding

Without explicit safety prompting, trip-planning agents bypass constraints in over 92% of cases and 9 of 17 web-use agent tests reach a 100% bypass rate, showing helpfulness-safety misalignment is the primary vulnerability.

user-mediated attacks

Novel technique introduced


Large Language Models (LLMs) have enabled agents to move beyond conversation toward end-to-end task execution and become more helpful. However, this helpfulness introduces new security risks stem less from direct interface abuse than from acting on user-provided content. Existing studies on agent security largely focus on model-internal vulnerabilities or adversarial access to agent interfaces, overlooking attacks that exploit users as unintended conduits. In this paper, we study user-mediated attacks, where benign users are tricked into relaying untrusted or attacker-controlled content to agents, and analyze how commercial LLM agents respond under such conditions. We conduct a systematic evaluation of 12 commercial agents in a sandboxed environment, covering 6 trip-planning agents and 6 web-use agents, and compare agent behavior across scenarios with no, soft, and hard user-requested safety checks. Our results show that agents are too helpful to be safe by default. Without explicit safety requests, trip-planning agents bypass safety constraints in over 92% of cases, converting unverified content into confident booking guidance. Web-use agents exhibit near-deterministic execution of risky actions, with 9 out of 17 supported tests reaching a 100% bypass rate. Even when users express soft or hard safety intent, constraint bypass remains substantial, reaching up to 54.7% and 7% for trip-planning agents, respectively. These findings reveal that the primary issue is not a lack of safety capability, but its prioritization. Agents invoke safety checks only conditionally when explicitly prompted, and otherwise default to goal-driven execution. Moreover, agents lack clear task boundaries and stopping rules, frequently over-executing workflows in ways that lead to unnecessary data disclosure and real-world harm.


Key Contributions

  • Introduces 'user-mediated attacks' as a novel threat paradigm where benign users serve as unintended conduits for attacker-controlled content to LLM agents, bypassing attacker-in-the-loop assumptions
  • Systematic sandboxed evaluation of 12 commercial LLM agents (6 trip-planning, 6 web-use) across no/soft/hard safety check conditions, revealing 92%+ safety bypass by default
  • Identifies that agents possess safety capabilities but fail to prioritize them by default, and lack task boundary enforcement, leading to over-execution and unnecessary data disclosure

🛡️ Threat Analysis


Details

Domains
nlp
Model Types
llm
Threat Tags
black_boxinference_time
Applications
trip-planning agentsweb-use agentsllm agentic systems