benchmark 2025

Beyond Jailbreak: Unveiling Risks in LLM Applications Arising from Blurred Capability Boundaries

Yunyi Zhang ¹, Shibo Cui ¹, Baojun Liu ¹, Jingkai Yu ¹, Min Zhang ², Fan Shi ^2,3, Han Zheng

¹ Tsinghua University

² National University of Defense Technology

³ TrustAl Pte. Ltd.

0 citations · 44 references · arXiv

Published on arXiv

2511.17874

Excessive Agency

OWASP LLM Top 10 — LLM08

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

178 out of 199 (89.45%) popular LLM applications exhibit blurred capability boundaries, with 17 executing malicious tasks directly without any adversarial prompt rewriting

LLMApp-Eval

Novel technique introduced

LLM applications (i.e., LLM apps) leverage the powerful capabilities of LLMs to provide users with customized services, revolutionizing traditional application development. While the increasing prevalence of LLM-powered applications provides users with unprecedented convenience, it also brings forth new security challenges. For such an emerging ecosystem, the security community lacks sufficient understanding of the LLM application ecosystem, especially regarding the capability boundaries of the applications themselves. In this paper, we systematically analyzed the new development paradigm and defined the concept of the LLM app capability space. We also uncovered potential new risks beyond jailbreak that arise from ambiguous capability boundaries in real-world scenarios, namely, capability downgrade and upgrade. To evaluate the impact of these risks, we designed and implemented an LLM app capability evaluation framework, LLMApp-Eval. First, we collected application metadata across 4 platforms and conducted a cross-platform ecosystem analysis. Then, we evaluated the risks for 199 popular applications among 4 platforms and 6 open-source LLMs. We identified that 178 (89.45%) potentially affected applications, which can perform tasks from more than 15 scenarios or be malicious. We even found 17 applications in our study that executed malicious tasks directly, without applying any adversarial rewriting. Furthermore, our experiments also reveal a positive correlation between the quality of prompt design and application robustness. We found that well-designed prompts enhance security, while poorly designed ones can facilitate abuse. We hope our work inspires the community to focus on the real-world risks of LLM applications and foster the development of a more robust LLM application ecosystem.

Key Contributions

Defines the concept of 'LLM app capability space' and formalizes two new risk types: capability downgrade (apps doing less than intended) and capability upgrade (apps doing more than intended)
Designs and implements LLMApp-Eval, a cross-platform evaluation framework that assesses real-world LLM apps for capability boundary violations across 4 platforms and 6 open-source LLMs
Large-scale empirical study of 199 popular LLM apps revealing 89.45% are potentially affected, including 17 apps that execute malicious tasks without any adversarial prompting, and a positive correlation between prompt design quality and application security

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llm

Threat Tags

black_boxinference_time

Datasets

199 LLM applications across 4 commercial platforms6 open-source LLMs

Applications

llm application platformsai-powered chatbot servicescustom gpt/agent deployments

Read PDF arXiv DOI

Beyond Jailbreak: Unveiling Risks in LLM Applications Arising from Blurred Capability Boundaries

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

WebTrap Park: An Automated Platform for Systematic Security Evaluation of Web Agents

Too Helpful to Be Safe: User-Mediated Attacks on Planning and Web-Use Agents

Helpful to a Fault: Measuring Illicit Assistance in Multi-Turn, Multilingual LLM Agents

ASTRA: Agentic Steerability and Risk Assessment Framework

Exposing Weak Links in Multi-Agent Systems under Adversarial Prompting

Mind the Gap: Evaluating Model- and Agentic-Level Vulnerabilities in LLMs with Action Graphs

AgentLAB: Benchmarking LLM Agents against Long-Horizon Attacks

Mind the GAP: Text Safety Does Not Transfer to Tool-Call Safety in LLM Agents