Raja Mehta Moreno

h-index: 1 2 citations 2 papers (total)

Papers in Database (2)

benchmark arXiv Nov 13, 2025 · Nov 2025

CTRL-ALT-DECEIT: Sabotage Evaluations for Automated AI R&D

Francis Rhys Ward, Teun van der Weij, Hanna Gábor et al. · Apollo Research · Independent +2 more

Benchmarks frontier LLM agents' ability to implant backdoors, sandbag ML models, and evade automated oversight monitors

Model Poisoning Excessive Agency nlp
2 citations 1 influentialPDF Code
defense arXiv Jan 28, 2026 · 9w ago

How does information access affect LLM monitors' ability to detect sabotage?

Rauno Arike, Raja Mehta Moreno, Rohan Subramani et al. · Aether Research · Vector Institute +4 more

Proposes extract-and-evaluate monitoring to catch sabotaging LLM agents, finding less monitor information often yields better detection.

Excessive Agency nlp
PDF