MCP-Guard: A Multi-Stage Defense-in-Depth Framework for Securing Model Context Protocol in Agentic AI
Wenpeng Xing 1,2, Zhonghao Qi 3, Yupeng Qin 2, Yilin Li 2, Caini Chang 2, Jiahui Yu 2, Changting Lin 2,4, Zhenzhen Xie 5, Meng Han 1,2,4
Published on arXiv
2508.10991
Insecure Plugin Design
OWASP LLM Top 10 — LLM07
Prompt Injection
OWASP LLM Top 10 — LLM01
Key Finding
The E5-based semantic detection component of MCP-GUARD achieves 96.01% accuracy in identifying adversarial prompts targeting MCP-integrated LLM systems.
MCP-GUARD
Novel technique introduced
While Large Language Models (LLMs) have achieved remarkable performance, they remain vulnerable to jailbreak. The integration of Large Language Models (LLMs) with external tools via protocols such as the Model Context Protocol (MCP) introduces critical security vulnerabilities, including prompt injection, data exfiltration, and other threats. To counter these challenges, we propose MCP-GUARD, a robust, layered defense architecture designed for LLM-tool interactions. MCP-GUARD employs a three-stage detection pipeline that balances efficiency with accuracy: it progresses from lightweight static scanning for overt threats and a deep neural detector for semantic attacks, to our fine-tuned E5-based model which achieves 96.01\% accuracy in identifying adversarial prompts. Finally, an LLM arbitrator synthesizes these signals to deliver the final decision. To enable rigorous training and evaluation, we introduce MCP-ATTACKBENCH, a comprehensive benchmark comprising 70,448 samples augmented by GPT-4. This benchmark simulates diverse real-world attack vectors that circumvent conventional defenses in the MCP paradigm, thereby laying a solid foundation for future research on securing LLM-tool ecosystems.
Key Contributions
- MCP-GUARD: a three-stage defense pipeline (static scanner → neural semantic detector → LLM arbitrator) for securing LLM-tool interactions over the Model Context Protocol
- A fine-tuned E5-based adversarial prompt classifier achieving 96.01% accuracy on MCP-specific attack patterns
- MCP-ATTACKBENCH: a GPT-4-augmented benchmark of 70,448 samples simulating diverse real-world attack vectors targeting MCP-enabled LLM ecosystems