benchmark 2025

MCPTox: A Benchmark for Tool Poisoning Attack on Real-World MCP Servers

Zhiqiang Wang ¹, Yichao Gao ¹, Yanting Wang ², Suyuan Liu ¹, Haifeng Sun ¹, Haoran Cheng ¹, Guanquan Shi ², Haohua Du ², Xiangyang Li ¹

¹ University of Science and Technology of China

² Beihang University

0 citations

Published on arXiv

2508.14925

Insecure Plugin Design

OWASP LLM Top 10 — LLM07

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Tool Poisoning achieves a 72.8% attack success rate against o1-mini, with no tested LLM agent exceeding a 3% refusal rate, demonstrating safety alignment is ineffective against this threat.

MCPTox

Novel technique introduced

By providing a standardized interface for LLM agents to interact with external tools, the Model Context Protocol (MCP) is quickly becoming a cornerstone of the modern autonomous agent ecosystem. However, it creates novel attack surfaces due to untrusted external tools. While prior work has focused on attacks injected through external tool outputs, we investigate a more fundamental vulnerability: Tool Poisoning, where malicious instructions are embedded within a tool's metadata without execution. To date, this threat has been primarily demonstrated through isolated cases, lacking a systematic, large-scale evaluation. We introduce MCPTox, the first benchmark to systematically evaluate agent robustness against Tool Poisoning in realistic MCP settings. MCPTox is constructed upon 45 live, real-world MCP servers and 353 authentic tools. To achieve this, we design three distinct attack templates to generate a comprehensive suite of 1312 malicious test cases by few-shot learning, covering 10 categories of potential risks. Our evaluation on 20 prominent LLM agents setting reveals a widespread vulnerability to Tool Poisoning, with o1-mini, achieving an attack success rate of 72.8\%. We find that more capable models are often more susceptible, as the attack exploits their superior instruction-following abilities. Finally, the failure case analysis reveals that agents rarely refuse these attacks, with the highest refused rate (Claude-3.7-Sonnet) less than 3\%, demonstrating that existing safety alignment is ineffective against malicious actions that use legitimate tools for unauthorized operation. Our findings create a crucial empirical baseline for understanding and mitigating this widespread threat, and we release MCPTox for the development of verifiably safer AI agents. Our dataset is available at an anonymized repository: \textit{https://anonymous.4open.science/r/AAAI26-7C02}.

Key Contributions

MCPTox: the first large-scale benchmark for Tool Poisoning, built on 45 live MCP servers and 353 authentic tools with 1312 malicious test cases covering 10 risk categories
Empirical evaluation of 20 LLM agents showing widespread vulnerability (o1-mini: 72.8% ASR) and the counterintuitive finding that more capable models are more susceptible due to stronger instruction-following
Evidence that existing safety alignment is nearly ineffective against Tool Poisoning, with the best model (Claude-3.7-Sonnet) refusing fewer than 3% of attacks

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llm

Threat Tags

black_boxinference_timetargeted

Datasets

MCPTox (45 MCP servers, 353 tools, 1312 malicious test cases — introduced by this paper)

Applications

llm agentsautonomous agentsmcp tool-calling systems

Read PDF arXiv Code

MCPTox: A Benchmark for Tool Poisoning Attack on Real-World MCP Servers

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Skill-Inject: Measuring Agent Vulnerability to Skill File Attacks

MCP Security Bench (MSB): Benchmarking Attacks Against Model Context Protocol in LLM Agents

Systematic Analysis of MCP Security

Unsafer in Many Turns: Benchmarking and Defending Multi-Turn Safety Risks in Tool-Using Agents

Quantifying Distributional Robustness of Agentic Tool-Selection

MCPSecBench: A Systematic Security Benchmark and Playground for Testing Model Context Protocols

When AI Meets the Web: Prompt Injection Risks in Third-Party AI Chatbot Plugins

Invisible Threats from Model Context Protocol: Generating Stealthy Injection Payload via Tree-based Adaptive Search