Latest papers

6 papers
defense arXiv Feb 9, 2026 · 8w ago

Basic Legibility Protocols Improve Trusted Monitoring

Ashwin Sreevatsa, Sebastian Prasanna, Cody Rushing · Cambridge Boston Alignment Initiative · Redwood Research

Legibility protocols using code comments improve trusted monitoring to detect backdoor-inserting adversarial LLM agents

Excessive Agency nlp
PDF
attack arXiv Feb 4, 2026 · 8w ago

Attack Selection Reduces Safety in Concentrated AI Control Settings against Trusted Monitoring

Joachim Schaeffer, Arjun Khandelwal, Tyler Tracy · Pivotal Research · University of Oxford +1 more

LLMs reasoning about monitors while selecting attacks reduce AI control safety from 99% to 59%, exposing optimistic safety evaluation blind spots

Excessive Agency nlp
PDF Code
benchmark arXiv Dec 17, 2025 · Dec 2025

BashArena: A Control Setting for Highly Privileged AI Agents

Adam Kaufman, James Lucassen, Tyler Tracy et al. · Redwood Research

Benchmark of 637 Linux sysadmin tasks with four sabotage objectives to evaluate AI control protocols for highly privileged LLM agents

Excessive Agency nlp
1 citations PDF Code
defense arXiv Dec 1, 2025 · Dec 2025

Factor(T,U): Factored Cognition Strengthens Monitoring of Untrusted AI

Aaron Sandoval, Cody Rushing · Redwood Research

Factored cognition control protocol isolates untrusted LLM subtask outputs, boosting backdoor detection from 41% to 63%.

Excessive Agency nlp
1 citations PDF Code
attack arXiv Nov 4, 2025 · Nov 2025

Optimizing AI Agent Attacks With Synthetic Data

Chloe Loughridge, Paul Colognese, Avery Griffin et al. · Anthropic · Redwood Research

Optimizes LLM agent attack policies in AI control evaluations, halving safety scores via probabilistic simulation and modular scaffold design

Excessive Agency Prompt Injection reinforcement-learningnlp
3 citations 1 influentialPDF
benchmark arXiv Nov 4, 2025 · Nov 2025

Evaluating Control Protocols for Untrusted AI Agents

Jon Kutasov, Chloe Loughridge, Yuqi Sun et al. · Anthropic · Reduct Video +2 more

Evaluates AI agent control protocols against adaptive red-team attacks, finding critical-action deferral highly robust while resampling collapses to 17% safety when attackers know protocol internals

Excessive Agency nlp
1 citations PDF