Joachim Schaeffer

attack arXiv Feb 4, 2026 · 8w ago

Joachim Schaeffer, Arjun Khandelwal, Tyler Tracy · Pivotal Research · University of Oxford +1 more

LLMs reasoning about monitors while selecting attacks reduce AI control safety from 99% to 59%, exposing optimistic safety evaluation blind spots

Excessive Agency nlp

Papers in Database (1)