benchmark 2025

Evaluating Language Model Reasoning about Confidential Information

Dylan Sam 1,2, Alexander Robey 1,2, Andy Zou 1,3,2, Matt Fredrikson 1,2, J. Zico Kolter 1

0 citations

α

Published on arXiv

2508.19980

Sensitive Information Disclosure

OWASP LLM Top 10 — LLM06

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Reasoning traces leak confidential information in a significant fraction of trials, and neither open- nor closed-source frontier models reliably enforce password-gated access control even on this simple rule-following task.

PasswordEval

Novel technique introduced


As language models are increasingly deployed as autonomous agents in high-stakes settings, ensuring that they reliably follow user-defined rules has become a critical safety concern. To this end, we study whether language models exhibit contextual robustness, or the capability to adhere to context-dependent safety specifications. For this analysis, we develop a benchmark (PasswordEval) that measures whether language models can correctly determine when a user request is authorized (i.e., with a correct password). We find that current open- and closed-source models struggle with this seemingly simple task, and that, perhaps surprisingly, reasoning capabilities do not generally improve performance. In fact, we find that reasoning traces frequently leak confidential information, which calls into question whether reasoning traces should be exposed to users in such applications. We also scale the difficulty of our evaluation along multiple axes: (i) by adding adversarial user pressure through various jailbreaking strategies, and (ii) through longer multi-turn conversations where password verification is more challenging. Overall, our results suggest that current frontier models are not well-suited to handling confidential information, and that reasoning capabilities may need to be trained in a different manner to make them safer for release in high-stakes settings.


Key Contributions

  • PasswordEval: a benchmark measuring whether LLMs can correctly enforce password-gated access control to confidential information, with difficulty scaled across jailbreaking strategies and multi-turn conversations
  • Finding that reasoning traces frequently leak confidential system-prompt content, raising concerns about exposing chain-of-thought to users in high-stakes agentic deployments
  • Empirical evidence that reasoning capabilities do not generally improve instruction-following robustness and that frontier models frequently fail even without adversarial pressure

🛡️ Threat Analysis


Details

Domains
nlp
Model Types
llm
Threat Tags
black_boxinference_time
Datasets
PasswordEval (introduced by authors)
Applications
llm agentsaccess control enforcementconfidential information handling