benchmark 2025

Can LLM Infer Risk Information From MCP Server System Logs?

Jiayi Fu 1, Yuansen Zhang 2, Yinggui Wang 2

0 citations · 95 references · arXiv (Cornell University)

α

Published on arXiv

2511.05867

Insecure Plugin Design

OWASP LLM Top 10 — LLM07

Key Finding

GRPO-trained Llama3.1-8B-Instruct achieves 83% accuracy detecting risky MCP system logs, surpassing the best large remote model by 9 percentage points with superior precision-recall balance over SFT.

MCP-RiskCue

Novel technique introduced


Large Language Models (LLMs) demonstrate strong capabilities in solving complex tasks when integrated with external tools. The Model Context Protocol (MCP) has become a standard interface for enabling such tool-based interactions. However, these interactions introduce substantial security concerns, particularly when the MCP server is compromised or untrustworthy. While prior benchmarks primarily focus on prompt injection attacks or analyze the vulnerabilities of LLM-MCP interaction trajectories, limited attention has been given to the underlying system logs associated with malicious MCP servers. To address this gap, we present the first synthetic benchmark for evaluating LLMs' ability to identify security risks from system logs. We define nine categories of MCP server risks and generate 1,800 synthetic system logs using ten state-of-the-art LLMs. These logs are embedded in the return values of 243 curated MCP servers, yielding a dataset of 2,421 chat histories for training and 471 queries for evaluation. Our pilot experiments reveal that smaller models often fail to detect risky system logs, leading to high false negatives. While models trained with supervised fine-tuning (SFT) tend to over-flag benign logs, resulting in elevated false positives, Reinforcement Learning with Verifiable Reward (RLVR) offers a better precision-recall balance. In particular, after training with Group Relative Policy Optimization (GRPO), Llama3.1-8B-Instruct achieves 83 percent accuracy, surpassing the best-performing large remote model by 9 percentage points. Fine-grained, per-category analysis further underscores the effectiveness of reinforcement learning in enhancing LLM safety within the MCP framework. Code and data are available at https://github.com/PorUna-byte/MCP-RiskCue.


Key Contributions

  • First synthetic benchmark (MCP-RiskCue) for evaluating LLMs' ability to detect security risks in MCP server system logs, with 9 risk categories, 1,800 synthetic logs, and 2,892 chat trajectories
  • Empirical finding that vanilla smaller LLMs near-randomly classify MCP system logs, while SFT overcorrects toward false positives and RLVR (GRPO) yields the best precision-recall balance
  • GRPO-trained Llama3.1-8B-Instruct achieves 83% accuracy on MCP risk detection, outperforming the best large remote model by 9 percentage points

🛡️ Threat Analysis


Details

Domains
nlp
Model Types
llm
Threat Tags
inference_time
Datasets
MCP-RiskCue (authors' own: 2,421 training + 471 evaluation chat histories)
Applications
llm tool-use securitymcp server risk detection