Chloe Li

h-index: 2 12 citations 2 papers (total)

Papers in Database (1)

defense arXiv Nov 10, 2025 · Nov 2025

Spilling the Beans: Teaching LLMs to Self-Report Their Hidden Objectives

Chloe Li, Mary Phuong, Daniel Tan · University College London · Center on Long-Term Risk

Fine-tunes LLMs to self-report hidden misaligned objectives when interrogated, achieving F1=0.98 detection vs F1=0 for baseline

Excessive Agency Prompt Injection nlp
6 citations PDF Code