ML Security Papers

Latest papers

2 papers

defense arXiv Nov 10, 2025 · Nov 2025

Chloe Li, Mary Phuong, Daniel Tan · University College London · Center on Long-Term Risk

Fine-tunes LLMs to self-report hidden misaligned objectives when interrogated, achieving F1=0.98 detection vs F1=0 for baseline

Excessive Agency Prompt Injection nlp

6 citations PDF Code

defense arXiv Oct 5, 2025 · Oct 2025

Daniel Tan, Anders Woodruff, Niels Warncke et al. · University College London · Center on Long-Term Risk +2 more

Proposes inoculation prompting, a training-time technique that suppresses backdoors and emergent misalignment in fine-tuned LLMs at test time

Model Poisoning Prompt Injection nlp

8 citations PDF