Neel Nanda

h-index: 33 9,369 citations 70 papers (total)

Papers in Database (2)

benchmark arXiv Oct 1, 2025 · Oct 2025

Eliciting Secret Knowledge from Language Models

Bartosz Cywiński, Emil Ryd, Rowan Wang et al. · arXiv · Senthooran Rajamanoharan IDEAS Research Institute +3 more

Benchmarks black-box and white-box techniques for auditing LLMs that secretly apply but deny hidden knowledge

Sensitive Information Disclosure Prompt Injection nlp
8 citations 2 influentialPDF Code
defense arXiv Jan 16, 2026 · 11w ago

Building Production-Ready Probes For Gemini

János Kramár, Joshua Engels, Zheng Wang et al. · Google DeepMind

Deploys activation probe classifiers in Gemini to intercept cyber-offensive misuse, solving long-context generalization and adaptive adversarial evasion

Prompt Injection nlp
3 citations PDF