defense 2025

Async Control: Stress-testing Asynchronous Control Measures for LLM Agents

Asa Cooper Stickland , Jan Michelfeit , Arathi Mani , Charlie Griffin , Ollie Matthews , Tomek Korbak , Rogan Inglis , Oliver Makins , Alan Cooney

UK AI Security Institute

1 citations · 28 references · arXiv

Published on arXiv

2512.13526

Excessive Agency

OWASP LLM Top 10 — LLM08

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Ensemble monitor achieves 6% false negative rate at 1% false positive rate on a held-out software engineering environment, with monitor performance improving across red-blue game rounds and red-team attacks failing to significantly degrade round-3 monitor performance.

Async Control (Asynchronous Ensemble Monitor)

Novel technique introduced

LLM-based software engineering agents are increasingly used in real-world development tasks, often with access to sensitive data or security-critical codebases. Such agents could intentionally sabotage these codebases if they were misaligned. We investigate asynchronous monitoring, in which a monitoring system reviews agent actions after the fact. Unlike synchronous monitoring, this approach does not impose runtime latency, while still attempting to disrupt attacks before irreversible harm occurs. We treat monitor development as an adversarial game between a blue team (who design monitors) and a red team (who create sabotaging agents). We attempt to set the game rules such that they upper bound the sabotage potential of an agent based on Claude 4.1 Opus. To ground this game in a realistic, high-stakes deployment scenario, we develop a suite of 5 diverse software engineering environments that simulate tasks that an agent might perform within an AI developer's internal infrastructure. Over the course of the game, we develop an ensemble monitor that achieves a 6% false negative rate at 1% false positive rate on a held out test environment. Then, we estimate risk of sabotage at deployment time by extrapolating from our monitor's false negative rate. We describe one simple model for this extrapolation, present a sensitivity analysis, and describe situations in which the model would be invalid. Code is available at: https://github.com/UKGovernmentBEIS/async-control.

Key Contributions

A suite of 5 diverse software engineering environments simulating high-stakes tasks within an AI developer's infrastructure, each with realistic benign tasks and associated sabotage side tasks (secret exfiltration, security policy weakening, etc.)
A structured red-blue team game methodology for stress-testing asynchronous monitors under realistic conditions, with explicit rules that upper-bound the sabotage potential of current frontier models
An ensemble monitor achieving 6% false negative rate at 1% false positive rate on a held-out test environment, plus a deployment risk extrapolation model with sensitivity analysis

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llm

Threat Tags

inference_timegrey_boxtargeted

Datasets

vLLM environmentKubernetes environmentInfrastructure as Code (IaC) environmentcustom SWE agent benchmark suite (5 environments)

Applications

software engineering agentsai developer internal infrastructuresecurity-critical codebases

Read PDF arXiv DOI Code

Async Control: Stress-testing Asynchronous Control Measures for LLM Agents

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Factor(U,T): Controlling Untrusted AI by Monitoring their Plans

Verifiability-First Agents: Provable Observability and Lightweight Audit Agents for Controlling Autonomous LLM Systems

From Spark to Fire: Modeling and Mitigating Error Cascades in LLM-Based Multi-Agent Collaboration

AgentSys: Secure and Dynamic LLM Agents Through Explicit Hierarchical Memory Management

WebWeaver: Breaking Topology Confidentiality in LLM Multi-Agent Systems with Stealthy Context-Based Inference

Adaptive Attacks on Trusted Monitors Subvert AI Control Protocols

Breaking and Fixing Defenses Against Control-Flow Hijacking in Multi-Agent Systems

A-MemGuard: A Proactive Defense Framework for LLM-Based Agent Memory