benchmark 2025

MCP-SafetyBench: A Benchmark for Safety Evaluation of Large Language Models with Real-World MCP Servers

4 citations · 37 references · arXiv

Published on arXiv

2512.15163

Insecure Plugin Design

OWASP LLM Top 10 — LLM07

Excessive Agency

OWASP LLM Top 10 — LLM08

Key Finding

Safety vulnerabilities in LLM agents escalate significantly with task horizon length and cross-server interactions, and safety prompts alone offer limited — sometimes counterproductive — protection against MCP attacks.

MCP-SafetyBench

Novel technique introduced

Large language models (LLMs) are evolving into agentic systems that reason, plan, and operate external tools. The Model Context Protocol (MCP) is a key enabler of this transition, offering a standardized interface for connecting LLMs with heterogeneous tools and services. Yet MCP's openness and multi-server workflows introduce new safety risks that existing benchmarks fail to capture, as they focus on isolated attacks or lack real-world coverage. We present MCP-SafetyBench, a comprehensive benchmark built on real MCP servers that supports realistic multi-turn evaluation across five domains: browser automation, financial analysis, location navigation, repository management, and web search. It incorporates a unified taxonomy of 20 MCP attack types spanning server, host, and user sides, and includes tasks requiring multi-step reasoning and cross-server coordination under uncertainty. Using MCP-SafetyBench, we systematically evaluate leading open- and closed-source LLMs, revealing large disparities in safety performance and escalating vulnerabilities as task horizons and server interactions grow. Our results highlight the urgent need for stronger defenses and establish MCP-SafetyBench as a foundation for diagnosing and mitigating safety risks in real-world MCP deployments.

Key Contributions

MCP-SafetyBench: a comprehensive benchmark built on real MCP servers covering 5 domains (browser automation, financial analysis, location navigation, repository management, web search)
Unified taxonomy of 20 MCP attack types spanning server, host, and user attack surfaces
Systematic evaluation of leading open- and closed-source LLMs revealing large safety disparities and compounding vulnerabilities with task horizon length and server count

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llm

Threat Tags

black_boxinference_time

Datasets

MCP-SafetyBench (new, built on real MCP servers)

Applications

llm agentstool-augmented llmsagentic ai systemsbrowser automationfinancial analysisrepository management

Read PDF arXiv DOI Code

MCP-SafetyBench: A Benchmark for Safety Evaluation of Large Language Models with Real-World MCP Servers

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Mind the Gap: Time-of-Check to Time-of-Use Vulnerabilities in LLM-Enabled Agents

LPS-Bench: Benchmarking Safety Awareness of Computer-Use Agents in Long-Horizon Planning under Benign and Adversarial Scenarios

From Tool Orchestration to Code Execution: A Study of MCP Design Choices

ACRFence: Preventing Semantic Rollback Attacks in Agent Checkpoint-Restore

Securing AI Agent Execution

SafeToolBench: Pioneering a Prospective Benchmark to Evaluating Tool Utilization Safety in LLMs

Evaluating Privilege Usage of Agents on Real-World Tools

Penetration Testing of Agentic AI: A Comparative Security Analysis Across Models and Frameworks