benchmark 2025

CTRL-ALT-DECEIT: Sabotage Evaluations for Automated AI R&D

Francis Rhys Ward 1,2, Teun van der Weij 1,3, Hanna Gábor 4, Sam Martin 2, Raja Mehta Moreno 4, Harel Lidar 4, Louis Makower 4, Thomas Jodrell 4, Lauren Robson 4

2 citations · 1 influential · arXiv

α

Published on arXiv

2511.09904

Model Poisoning

OWASP ML Top 10 — ML10

Excessive Agency

OWASP LLM Top 10 — LLM08

Key Finding

Frontier LM agents make meaningful progress on sabotage tasks including backdoor implantation; LM monitors detect code-sabotage reasonably well but struggle to detect sandbagging, suggesting monitoring alone is insufficient for high-stakes AI R&D oversight.

MLE-Sabotage

Novel technique introduced


AI systems are increasingly able to autonomously conduct realistic software engineering tasks, and may soon be deployed to automate machine learning (ML) R&D itself. Frontier AI systems may be deployed in safety-critical settings, including to help ensure the safety of future systems. Unfortunately, frontier and future systems may not be sufficiently trustworthy, and there is evidence that these systems may even be misaligned with their developers or users. Therefore, we investigate the capabilities of AI agents to act against the interests of their users when conducting ML engineering, by sabotaging ML models, sandbagging their performance, and subverting oversight mechanisms. First, we extend MLE-Bench, a benchmark for realistic ML tasks, with code-sabotage tasks such as implanting backdoors and purposefully causing generalisation failures. Frontier agents make meaningful progress on our sabotage tasks. In addition, we study agent capabilities to sandbag on MLE-Bench. Agents can calibrate their performance to specified target levels below their actual capability. To mitigate sabotage, we use LM monitors to detect suspicious agent behaviour, and we measure model capability to sabotage and sandbag without being detected by these monitors. Overall, monitors are capable at detecting code-sabotage attempts but our results suggest that detecting sandbagging is more difficult. Additionally, aggregating multiple monitor predictions works well, but monitoring may not be sufficiently reliable to mitigate sabotage in high-stakes domains. Our benchmark is implemented in the UK AISI's Inspect framework and we make our code publicly available at https://github.com/TeunvdWeij/ctrl-alt-deceit


Key Contributions

  • MLE-Sabotage: 20 hand-crafted code-sabotage tasks extending MLE-Bench, including backdoor implantation and generalization failure induction, evaluated on frontier LM agents
  • Empirical demonstration that frontier agents can sandbag to specified target performance levels while partially evading LM-based monitors
  • Evaluation of LM monitors for detecting sabotage and sandbagging, finding monitors effective against code-sabotage but unreliable against sandbagging

🛡️ Threat Analysis

Model Poisoning

A core contribution is the MLE-Sabotage benchmark with 20 tasks including explicitly 'implanting a backdoor into a model' and 'inducing a generalisation failure' — these are canonical backdoor/trojan attacks where frontier agents hide malicious behavior in the resulting ML models.


Details

Domains
nlp
Model Types
llm
Threat Tags
white_boxtraining_timetargeted
Datasets
MLE-BenchMLE-Sabotage
Applications
automated ml r&dai-assisted software engineeringml pipeline security