benchmark 2025

Evaluating Language Model Reasoning about Confidential Information

Dylan Sam ^1,2, Alexander Robey ^1,2, Andy Zou ^1,3,2, Matt Fredrikson ^1,2, J. Zico Kolter ¹

¹ Carnegie Mellon University

² Gray Swan AI

³ Center for AI Safety

0 citations

Published on arXiv

2508.19980

Sensitive Information Disclosure

OWASP LLM Top 10 — LLM06

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Reasoning traces leak confidential information in a significant fraction of trials, and neither open- nor closed-source frontier models reliably enforce password-gated access control even on this simple rule-following task.

PasswordEval

Novel technique introduced

As language models are increasingly deployed as autonomous agents in high-stakes settings, ensuring that they reliably follow user-defined rules has become a critical safety concern. To this end, we study whether language models exhibit contextual robustness, or the capability to adhere to context-dependent safety specifications. For this analysis, we develop a benchmark (PasswordEval) that measures whether language models can correctly determine when a user request is authorized (i.e., with a correct password). We find that current open- and closed-source models struggle with this seemingly simple task, and that, perhaps surprisingly, reasoning capabilities do not generally improve performance. In fact, we find that reasoning traces frequently leak confidential information, which calls into question whether reasoning traces should be exposed to users in such applications. We also scale the difficulty of our evaluation along multiple axes: (i) by adding adversarial user pressure through various jailbreaking strategies, and (ii) through longer multi-turn conversations where password verification is more challenging. Overall, our results suggest that current frontier models are not well-suited to handling confidential information, and that reasoning capabilities may need to be trained in a different manner to make them safer for release in high-stakes settings.

Key Contributions

PasswordEval: a benchmark measuring whether LLMs can correctly enforce password-gated access control to confidential information, with difficulty scaled across jailbreaking strategies and multi-turn conversations
Finding that reasoning traces frequently leak confidential system-prompt content, raising concerns about exposing chain-of-thought to users in high-stakes agentic deployments
Empirical evidence that reasoning capabilities do not generally improve instruction-following robustness and that frontier models frequently fail even without adversarial pressure

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llm

Threat Tags

black_boxinference_time

Datasets

PasswordEval (introduced by authors)

Applications

llm agentsaccess control enforcementconfidential information handling

Read PDF arXiv Code

Evaluating Language Model Reasoning about Confidential Information

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Automated Framework to Evaluate and Harden LLM System Instructions against Encoding Attacks

A Comprehensive Evaluation of LLM Unlearning Robustness under Multi-Turn Interaction

Quantifying Return on Security Controls in LLM Systems

A Practical Framework for Evaluating Medical AI Security: Reproducible Assessment of Jailbreaking and Privacy Vulnerabilities Across Clinical Specialties

ConVerse: Benchmarking Contextual Safety in Agent-to-Agent Conversations

Bits Leaked per Query: Information-Theoretic Bounds on Adversarial Attacks against LLMs

Benchmarking LLAMA Model Security Against OWASP Top 10 For LLM Applications

Exploiting Web Search Tools of AI Agents for Data Exfiltration