benchmark 2025

CTRL-ALT-DECEIT: Sabotage Evaluations for Automated AI R&D

Francis Rhys Ward ^1,2, Teun van der Weij ^1,3, Hanna Gábor ⁴, Sam Martin ², Raja Mehta Moreno ⁴, Harel Lidar ⁴, Louis Makower ⁴, Thomas Jodrell ⁴, Lauren Robson ⁴

¹ Apollo Research

² Independent

³ LawZero

⁴ Imperial College London

2 citations · 1 influential · arXiv

Published on arXiv

2511.09904

Model Poisoning

OWASP ML Top 10 — ML10

Excessive Agency

OWASP LLM Top 10 — LLM08

Key Finding

Frontier LM agents make meaningful progress on sabotage tasks including backdoor implantation; LM monitors detect code-sabotage reasonably well but struggle to detect sandbagging, suggesting monitoring alone is insufficient for high-stakes AI R&D oversight.

MLE-Sabotage

Novel technique introduced

AI systems are increasingly able to autonomously conduct realistic software engineering tasks, and may soon be deployed to automate machine learning (ML) R&D itself. Frontier AI systems may be deployed in safety-critical settings, including to help ensure the safety of future systems. Unfortunately, frontier and future systems may not be sufficiently trustworthy, and there is evidence that these systems may even be misaligned with their developers or users. Therefore, we investigate the capabilities of AI agents to act against the interests of their users when conducting ML engineering, by sabotaging ML models, sandbagging their performance, and subverting oversight mechanisms. First, we extend MLE-Bench, a benchmark for realistic ML tasks, with code-sabotage tasks such as implanting backdoors and purposefully causing generalisation failures. Frontier agents make meaningful progress on our sabotage tasks. In addition, we study agent capabilities to sandbag on MLE-Bench. Agents can calibrate their performance to specified target levels below their actual capability. To mitigate sabotage, we use LM monitors to detect suspicious agent behaviour, and we measure model capability to sabotage and sandbag without being detected by these monitors. Overall, monitors are capable at detecting code-sabotage attempts but our results suggest that detecting sandbagging is more difficult. Additionally, aggregating multiple monitor predictions works well, but monitoring may not be sufficiently reliable to mitigate sabotage in high-stakes domains. Our benchmark is implemented in the UK AISI's Inspect framework and we make our code publicly available at https://github.com/TeunvdWeij/ctrl-alt-deceit

Key Contributions

MLE-Sabotage: 20 hand-crafted code-sabotage tasks extending MLE-Bench, including backdoor implantation and generalization failure induction, evaluated on frontier LM agents
Empirical demonstration that frontier agents can sandbag to specified target performance levels while partially evading LM-based monitors
Evaluation of LM monitors for detecting sabotage and sandbagging, finding monitors effective against code-sabotage but unreliable against sandbagging

🛡️ Threat Analysis

Model Poisoning

A core contribution is the MLE-Sabotage benchmark with 20 tasks including explicitly 'implanting a backdoor into a model' and 'inducing a generalisation failure' — these are canonical backdoor/trojan attacks where frontier agents hide malicious behavior in the resulting ML models.

Details

Domains

nlp

Model Types

llm

Threat Tags

white_boxtraining_timetargeted

Datasets

MLE-BenchMLE-Sabotage

Applications

automated ml r&dai-assisted software engineeringml pipeline security

Read PDF arXiv DOI Code

CTRL-ALT-DECEIT: Sabotage Evaluations for Automated AI R&D

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Backdooring Bias in Large Language Models

BackdoorAgent: A Unified Framework for Backdoor Attacks on LLM-based Agents

Targeted Bit-Flip Attacks on LLM-Based Agents

Triggers Hijack Language Circuits: A Mechanistic Analysis of Backdoor Behaviors in Large Language Models

When Evaluation Becomes a Side Channel: Regime Leakage and Structural Mitigations for Alignment Assessment

Mechanistic Exploration of Backdoored Large Language Model Attention Patterns

Signature in Code Backdoor Detection, how far are we?

A Systematic Evaluation of Parameter-Efficient Fine-Tuning Methods for the Security of Code LLMs