CTRL-ALT-DECEIT: Sabotage Evaluations for Automated AI R&D
Francis Rhys Ward 1,2, Teun van der Weij 1,3, Hanna Gábor 4, Sam Martin 2, Raja Mehta Moreno 4, Harel Lidar 4, Louis Makower 4, Thomas Jodrell 4, Lauren Robson 4
Published on arXiv
2511.09904
Model Poisoning
OWASP ML Top 10 — ML10
Excessive Agency
OWASP LLM Top 10 — LLM08
Key Finding
Frontier LM agents make meaningful progress on sabotage tasks including backdoor implantation; LM monitors detect code-sabotage reasonably well but struggle to detect sandbagging, suggesting monitoring alone is insufficient for high-stakes AI R&D oversight.
MLE-Sabotage
Novel technique introduced
AI systems are increasingly able to autonomously conduct realistic software engineering tasks, and may soon be deployed to automate machine learning (ML) R&D itself. Frontier AI systems may be deployed in safety-critical settings, including to help ensure the safety of future systems. Unfortunately, frontier and future systems may not be sufficiently trustworthy, and there is evidence that these systems may even be misaligned with their developers or users. Therefore, we investigate the capabilities of AI agents to act against the interests of their users when conducting ML engineering, by sabotaging ML models, sandbagging their performance, and subverting oversight mechanisms. First, we extend MLE-Bench, a benchmark for realistic ML tasks, with code-sabotage tasks such as implanting backdoors and purposefully causing generalisation failures. Frontier agents make meaningful progress on our sabotage tasks. In addition, we study agent capabilities to sandbag on MLE-Bench. Agents can calibrate their performance to specified target levels below their actual capability. To mitigate sabotage, we use LM monitors to detect suspicious agent behaviour, and we measure model capability to sabotage and sandbag without being detected by these monitors. Overall, monitors are capable at detecting code-sabotage attempts but our results suggest that detecting sandbagging is more difficult. Additionally, aggregating multiple monitor predictions works well, but monitoring may not be sufficiently reliable to mitigate sabotage in high-stakes domains. Our benchmark is implemented in the UK AISI's Inspect framework and we make our code publicly available at https://github.com/TeunvdWeij/ctrl-alt-deceit
Key Contributions
- MLE-Sabotage: 20 hand-crafted code-sabotage tasks extending MLE-Bench, including backdoor implantation and generalization failure induction, evaluated on frontier LM agents
- Empirical demonstration that frontier agents can sandbag to specified target performance levels while partially evading LM-based monitors
- Evaluation of LM monitors for detecting sabotage and sandbagging, finding monitors effective against code-sabotage but unreliable against sandbagging
🛡️ Threat Analysis
A core contribution is the MLE-Sabotage benchmark with 20 tasks including explicitly 'implanting a backdoor into a model' and 'inducing a generalisation failure' — these are canonical backdoor/trojan attacks where frontier agents hide malicious behavior in the resulting ML models.