benchmark 2025

ConVerse: Benchmarking Contextual Safety in Agent-to-Agent Conversations

5 citations · 2 influential · 33 references · arXiv

Published on arXiv

2511.05359

Prompt Injection

OWASP LLM Top 10 — LLM01

Sensitive Information Disclosure

OWASP LLM Top 10 — LLM06

Key Finding

Privacy attacks succeed in up to 88% of cases and security attacks in up to 60% across 7 SOTA LLMs, with higher-capability models showing greater information leakage despite better task completion.

ConVerse

Novel technique introduced

As language models evolve into autonomous agents that act and communicate on behalf of users, ensuring safety in multi-agent ecosystems becomes a central challenge. Interactions between personal assistants and external service providers expose a core tension between utility and protection: effective collaboration requires information sharing, yet every exchange creates new attack surfaces. We introduce ConVerse, a dynamic benchmark for evaluating privacy and security risks in agent-agent interactions. ConVerse spans three practical domains (travel, real estate, insurance) with 12 user personas and over 864 contextually grounded attacks (611 privacy, 253 security). Unlike prior single-agent settings, it models autonomous, multi-turn agent-to-agent conversations where malicious requests are embedded within plausible discourse. Privacy is tested through a three-tier taxonomy assessing abstraction quality, while security attacks target tool use and preference manipulation. Evaluating seven state-of-the-art models reveals persistent vulnerabilities; privacy attacks succeed in up to 88% of cases and security breaches in up to 60%, with stronger models leaking more. By unifying privacy and security within interactive multi-agent contexts, ConVerse reframes safety as an emergent property of communication.

Key Contributions

ConVerse benchmark with 864 contextually grounded attacks (611 privacy, 253 security) across 3 domains and 12 user personas for multi-agent LLM evaluation
Three-tier privacy taxonomy assessing abstraction quality (unrelated, related-but-private, related-and-useful) rather than binary filtering
Empirical evaluation of 7 SOTA LLMs revealing privacy attack success rates of 37–88% and security breach rates of 2–60%, with stronger models leaking more

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llm

Threat Tags

black_boxinference_time

Datasets

ConVerse (introduced in this work)

Applications

multi-agent ai systemspersonal ai assistantsautonomous llm agents

Read PDF arXiv DOI Code

ConVerse: Benchmarking Contextual Safety in Agent-to-Agent Conversations

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Quantifying Return on Security Controls in LLM Systems

Bits Leaked per Query: Information-Theoretic Bounds on Adversarial Attacks against LLMs

A Comprehensive Evaluation of LLM Unlearning Robustness under Multi-Turn Interaction

Automated Framework to Evaluate and Harden LLM System Instructions against Encoding Attacks

A Practical Framework for Evaluating Medical AI Security: Reproducible Assessment of Jailbreaking and Privacy Vulnerabilities Across Clinical Specialties

Evaluating Language Model Reasoning about Confidential Information

Exploiting Web Search Tools of AI Agents for Data Exfiltration

In AI Sweet Harmony: Sociopragmatic Guardrail Bypasses and Evaluation-Awareness in OpenAI gpt-oss-20b