benchmark 2026

Don't believe everything you read: Understanding and Measuring MCP Behavior under Misleading Tool Descriptions

Zhihao Li , Boyang Ma , Xuelong Dai , Minghui Xu , Yue Zhang , Biwei Yan , Kun Li

0 citations · 30 references · arXiv (Cornell University)

α

Published on arXiv

2602.03580

Insecure Plugin Design

OWASP LLM Top 10 — LLM07

Key Finding

Approximately 13% of real-world MCP servers exhibit substantial description-code mismatches that can enable undocumented privileged operations or unauthorized financial actions by LLM agents relying on tool descriptions.

MCPDiFF

Novel technique introduced


The Model Context Protocol (MCP) enables large language models to invoke external tools through natural-language descriptions, forming the foundation of many AI agent applications. However, MCP does not enforce consistency between documented tool behavior and actual code execution, even though MCP Servers often run with broad system privileges. This gap introduces a largely unexplored security risk. We study how mismatches between externally presented tool descriptions and underlying implementations systematically shape the mental models and decision-making behavior of intelligent agents. Specifically, we present the first large-scale study of description-code inconsistency in the MCP ecosystem. We design an automated static analysis framework and apply it to 10,240 real-world MCP Servers across 36 categories. Our results show that while most servers are highly consistent, approximately 13% exhibit substantial mismatches that can enable undocumented privileged operations, hidden state mutations, or unauthorized financial actions. We further observe systematic differences across application categories, popularity levels, and MCP marketplaces. Our findings demonstrate that description-code inconsistency is a concrete and prevalent attack surface in MCP-based AI agents, and motivate the need for systematic auditing and stronger transparency guarantees in future agent ecosystems.


Key Contributions

  • First large-scale empirical measurement study of description-code inconsistency across 10,240 real-world MCP servers spanning 36 categories
  • MCPDiFF: an automated static analysis framework that compares declared MCP tool capabilities against actual implemented behavior
  • Characterization of the attack surface showing ~13% of servers exhibit substantial mismatches enabling undocumented privileged operations, hidden state mutations, or unauthorized financial actions

🛡️ Threat Analysis


Details

Domains
nlp
Model Types
llm
Threat Tags
inference_timeblack_box
Datasets
10,240 real-world MCP Servers (36 categories, multiple MCP marketplaces)
Applications
llm agentsmcp-based ai agent systemstool-augmented llms