Aman Neelappa

h-index: 2 28 citations 7 papers (total)

Papers in Database (1)

benchmark arXiv Sep 16, 2025 · Sep 2025

Towards mitigating information leakage when evaluating safety monitors

Gerard Boxo, Aman Neelappa, Shivam Raval · Independent · Harvard University

Benchmarks LLM safety monitors (linear probes) revealing 10–40% AUROC inflation from textual leakage artifacts, not genuine internal signals

Prompt Injection nlp
PDF