MCP-SafetyBench: A Benchmark for Safety Evaluation of Large Language Models with Real-World MCP Servers
Xuanjun Zong 1, Zhiqi Shen 2, Lei Wang 3, Yunshi Lan 1, Chao Yang 4
Published on arXiv
2512.15163
Insecure Plugin Design
OWASP LLM Top 10 — LLM07
Excessive Agency
OWASP LLM Top 10 — LLM08
Key Finding
Safety vulnerabilities in LLM agents escalate significantly with task horizon length and cross-server interactions, and safety prompts alone offer limited — sometimes counterproductive — protection against MCP attacks.
MCP-SafetyBench
Novel technique introduced
Large language models (LLMs) are evolving into agentic systems that reason, plan, and operate external tools. The Model Context Protocol (MCP) is a key enabler of this transition, offering a standardized interface for connecting LLMs with heterogeneous tools and services. Yet MCP's openness and multi-server workflows introduce new safety risks that existing benchmarks fail to capture, as they focus on isolated attacks or lack real-world coverage. We present MCP-SafetyBench, a comprehensive benchmark built on real MCP servers that supports realistic multi-turn evaluation across five domains: browser automation, financial analysis, location navigation, repository management, and web search. It incorporates a unified taxonomy of 20 MCP attack types spanning server, host, and user sides, and includes tasks requiring multi-step reasoning and cross-server coordination under uncertainty. Using MCP-SafetyBench, we systematically evaluate leading open- and closed-source LLMs, revealing large disparities in safety performance and escalating vulnerabilities as task horizons and server interactions grow. Our results highlight the urgent need for stronger defenses and establish MCP-SafetyBench as a foundation for diagnosing and mitigating safety risks in real-world MCP deployments.
Key Contributions
- MCP-SafetyBench: a comprehensive benchmark built on real MCP servers covering 5 domains (browser automation, financial analysis, location navigation, repository management, web search)
- Unified taxonomy of 20 MCP attack types spanning server, host, and user attack surfaces
- Systematic evaluation of leading open- and closed-source LLMs revealing large safety disparities and compounding vulnerabilities with task horizon length and server count