Joachim Schaeffer

h-index: 0 0 citations 1 papers (total)

Papers in Database (1)

attack arXiv Feb 4, 2026 · 8w ago

Attack Selection Reduces Safety in Concentrated AI Control Settings against Trusted Monitoring

Joachim Schaeffer, Arjun Khandelwal, Tyler Tracy · Pivotal Research · University of Oxford +1 more

LLMs reasoning about monitors while selecting attacks reduce AI control safety from 99% to 59%, exposing optimistic safety evaluation blind spots

Excessive Agency nlp
PDF Code