Scott Emmons

h-index: 2 247 citations 4 papers (total)

Papers in Database (1)

attack arXiv Dec 12, 2025 · Dec 2025

Neural Chameleons: Language Models Can Learn to Hide Their Thoughts from Unseen Activation Monitors

Max McGuinness, Alex Serrano, Luke Bailey et al. · MATS · UC Berkeley +1 more

Fine-tuning embeds trigger-activated backdoor enabling LLMs to zero-shot evade unseen activation safety monitors

Model Poisoning Prompt Injection nlp
2 citations PDF Code