Neel Nanda

benchmark arXiv Oct 1, 2025 · Oct 2025

Eliciting Secret Knowledge from Language Models

Bartosz Cywiński, Emil Ryd, Rowan Wang et al. · arXiv · Senthooran Rajamanoharan IDEAS Research Institute +3 more

Benchmarks black-box and white-box techniques for auditing LLMs that secretly apply but deny hidden knowledge

Sensitive Information Disclosure Prompt Injection nlp

8 citations 2 influentialPDF Code

defense arXiv Jan 16, 2026 · 11w ago

Building Production-Ready Probes For Gemini

János Kramár, Joshua Engels, Zheng Wang et al. · Google DeepMind

Deploys activation probe classifiers in Gemini to intercept cyber-offensive misuse, solving long-context generalization and adaptive adversarial evasion

Prompt Injection nlp

3 citations PDF

Papers in Database (2)

Eliciting Secret Knowledge from Language Models

Building Production-Ready Probes For Gemini