defense 2025

Quantifying Conversation Drift in MCP via Latent Polytope

0 citations

Published on arXiv

2508.06418

Insecure Plugin Design

OWASP LLM Top 10 — LLM07

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

SecMCP detects indirect prompt injection and tool poisoning attacks in MCP-enabled LLMs with AUROC scores consistently exceeding 0.915 across all evaluated models and datasets.

SecMCP (latent polytope conversation drift detection)

Novel technique introduced

The Model Context Protocol (MCP) enhances large language models (LLMs) by integrating external tools, enabling dynamic aggregation of real-time data to improve task execution. However, its non-isolated execution context introduces critical security and privacy risks. In particular, adversarially crafted content can induce tool poisoning or indirect prompt injection, leading to conversation hijacking, misinformation propagation, or data exfiltration. Existing defenses, such as rule-based filters or LLM-driven detection, remain inadequate due to their reliance on static signatures, computational inefficiency, and inability to quantify conversational hijacking. To address these limitations, we propose SecMCP, a secure framework that detects and quantifies conversation drift, deviations in latent space trajectories induced by adversarial external knowledge. By modeling LLM activation vectors within a latent polytope space, SecMCP identifies anomalous shifts in conversational dynamics, enabling proactive detection of hijacking, misleading, and data exfiltration. We evaluate SecMCP on three state-of-the-art LLMs (Llama3, Vicuna, Mistral) across benchmark datasets (MS MARCO, HotpotQA, FinQA), demonstrating robust detection with AUROC scores exceeding 0.915 while maintaining system usability. Our contributions include a systematic categorization of MCP security threats, a novel latent polytope-based methodology for quantifying conversation drift, and empirical validation of SecMCP's efficacy.

Key Contributions

Systematic categorization of MCP security threats into three primary attack classes: conversation hijacking, misleading, and data exfiltration
SecMCP framework that models LLM activation vectors in a latent polytope space to detect and quantify adversarial conversation drift induced by malicious external tool content
Empirical validation on three LLMs (Llama3, Vicuna, Mistral) and three benchmark datasets (MS MARCO, HotpotQA, FinQA), achieving AUROC > 0.915 with negligible usability impact

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

inference_timeblack_box

Datasets

MS MARCOHotpotQAFinQA

Applications

llm agent systemsmcp-powered tool integrationretrieval-augmented generation

Read PDF arXiv

Quantifying Conversation Drift in MCP via Latent Polytope

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Defense Against Indirect Prompt Injection via Tool Result Parsing

MCP-Guard: A Multi-Stage Defense-in-Depth Framework for Securing Model Context Protocol in Agentic AI

Breaking the Protocol: Security Analysis of the Model Context Protocol Specification and Prompt Injection Vulnerabilities in Tool-Integrated LLM Agents

Introducing the Generative Application Firewall (GAF)

AgentSentry: Mitigating Indirect Prompt Injection in LLM Agents via Temporal Causal Diagnostics and Context Purification

VIGIL: Defending LLM Agents Against Tool Stream Injection via Verify-Before-Commit

Securing the Model Context Protocol: Defending LLMs Against Tool Poisoning and Adversarial Attacks

ToolSafe: Enhancing Tool Invocation Safety of LLM-based agents via Proactive Step-level Guardrail and Feedback