benchmark 2025

Eliciting Secret Knowledge from Language Models

8 citations · 2 influential · arXiv

Published on arXiv

2510.01070

Sensitive Information Disclosure

OWASP LLM Top 10 — LLM06

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Prefill attacks are the most effective elicitation technique across all secret-keeping settings, outperforming white-box interpretability methods such as logit lens and sparse autoencoders.

Prefill attack for secret elicitation

Novel technique introduced

We study secret elicitation: discovering knowledge that an AI possesses but does not explicitly verbalize. As a testbed, we train three families of large language models (LLMs) to possess specific knowledge that they apply downstream but deny knowing when asked directly. For example, in one setting, we train an LLM to generate replies that are consistent with knowing the user is female, while denying this knowledge when asked directly. We then design various black-box and white-box secret elicitation techniques and evaluate them based on whether they can help an LLM auditor successfully guess the secret knowledge. Many of our techniques improve on simple baselines. Our most effective techniques (performing best in all settings) are based on prefill attacks, a black-box technique where the LLM reveals secret knowledge when generating a completion from a predefined prefix. Our white-box techniques based on logit lens and sparse autoencoders (SAEs) also consistently increase the success rate of the LLM auditor, but are less effective. We release our models and code, establishing a public benchmark for evaluating secret elicitation methods.

Key Contributions

Three families of fine-tuned LLM 'model organisms' with ground-truth secret knowledge (user gender, secret words, secret instructions) that models apply but deny — providing a controlled testbed for auditing research.
Suite of black-box elicitation techniques (prefill attacks, adversarial prompting, user persona sampling, many-shot jailbreaking) and white-box techniques (logit lens, sparse autoencoders) for uncovering hidden LLM knowledge.
Public benchmark with open-sourced models and code for systematically evaluating secret elicitation methods in an auditing-game framework.

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

black_boxwhite_boxinference_time

Applications

llm alignment auditingai safety evaluationhidden objective detection

Read PDF arXiv DOI Code

Eliciting Secret Knowledge from Language Models

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Benchmarking LLAMA Model Security Against OWASP Top 10 For LLM Applications

Quantifying Return on Security Controls in LLM Systems

A Practical Framework for Evaluating Medical AI Security: Reproducible Assessment of Jailbreaking and Privacy Vulnerabilities Across Clinical Specialties

Evaluating Language Model Reasoning about Confidential Information

Bits Leaked per Query: Information-Theoretic Bounds on Adversarial Attacks against LLMs

Automated Framework to Evaluate and Harden LLM System Instructions against Encoding Attacks

ConVerse: Benchmarking Contextual Safety in Agent-to-Agent Conversations

A Comprehensive Evaluation of LLM Unlearning Robustness under Multi-Turn Interaction