benchmark 2025

MCP-SafetyBench: A Benchmark for Safety Evaluation of Large Language Models with Real-World MCP Servers

Xuanjun Zong 1, Zhiqi Shen 2, Lei Wang 3, Yunshi Lan 1, Chao Yang 4

4 citations · 37 references · arXiv

α

Published on arXiv

2512.15163

Insecure Plugin Design

OWASP LLM Top 10 — LLM07

Excessive Agency

OWASP LLM Top 10 — LLM08

Key Finding

Safety vulnerabilities in LLM agents escalate significantly with task horizon length and cross-server interactions, and safety prompts alone offer limited — sometimes counterproductive — protection against MCP attacks.

MCP-SafetyBench

Novel technique introduced


Large language models (LLMs) are evolving into agentic systems that reason, plan, and operate external tools. The Model Context Protocol (MCP) is a key enabler of this transition, offering a standardized interface for connecting LLMs with heterogeneous tools and services. Yet MCP's openness and multi-server workflows introduce new safety risks that existing benchmarks fail to capture, as they focus on isolated attacks or lack real-world coverage. We present MCP-SafetyBench, a comprehensive benchmark built on real MCP servers that supports realistic multi-turn evaluation across five domains: browser automation, financial analysis, location navigation, repository management, and web search. It incorporates a unified taxonomy of 20 MCP attack types spanning server, host, and user sides, and includes tasks requiring multi-step reasoning and cross-server coordination under uncertainty. Using MCP-SafetyBench, we systematically evaluate leading open- and closed-source LLMs, revealing large disparities in safety performance and escalating vulnerabilities as task horizons and server interactions grow. Our results highlight the urgent need for stronger defenses and establish MCP-SafetyBench as a foundation for diagnosing and mitigating safety risks in real-world MCP deployments.


Key Contributions

  • MCP-SafetyBench: a comprehensive benchmark built on real MCP servers covering 5 domains (browser automation, financial analysis, location navigation, repository management, web search)
  • Unified taxonomy of 20 MCP attack types spanning server, host, and user attack surfaces
  • Systematic evaluation of leading open- and closed-source LLMs revealing large safety disparities and compounding vulnerabilities with task horizon length and server count

🛡️ Threat Analysis


Details

Domains
nlp
Model Types
llm
Threat Tags
black_boxinference_time
Datasets
MCP-SafetyBench (new, built on real MCP servers)
Applications
llm agentstool-augmented llmsagentic ai systemsbrowser automationfinancial analysisrepository management