Latest papers

2 papers
defense arXiv Nov 10, 2025 · Nov 2025

Spilling the Beans: Teaching LLMs to Self-Report Their Hidden Objectives

Chloe Li, Mary Phuong, Daniel Tan · University College London · Center on Long-Term Risk

Fine-tunes LLMs to self-report hidden misaligned objectives when interrogated, achieving F1=0.98 detection vs F1=0 for baseline

Excessive Agency Prompt Injection nlp
6 citations PDF Code
defense arXiv Oct 5, 2025 · Oct 2025

Inoculation Prompting: Eliciting traits from LLMs during training can suppress them at test-time

Daniel Tan, Anders Woodruff, Niels Warncke et al. · University College London · Center on Long-Term Risk +2 more

Proposes inoculation prompting, a training-time technique that suppresses backdoors and emergent misalignment in fine-tuned LLMs at test time

Model Poisoning Prompt Injection nlp
8 citations PDF