benchmark 2025

Evaluating Control Protocols for Untrusted AI Agents

1 citations · 27 references · arXiv

Published on arXiv

2511.02997

Excessive Agency

OWASP LLM Top 10 — LLM08

Key Finding

Deferring on critical actions achieves 91% safety against a default attack policy and remains highly robust to adaptive red-team strategies, while resampling-based protocols collapse to 17% safety when the attack policy has knowledge of when resampling occurs.

Critical Action Deferral

Novel technique introduced

As AI systems become more capable and widely deployed as agents, ensuring their safe operation becomes critical. AI control offers one approach to mitigating the risk from untrusted AI agents by monitoring their actions and intervening or auditing when necessary. Evaluating the safety of these protocols requires understanding both their effectiveness against current attacks and their robustness to adaptive adversaries. In this work, we systematically evaluate a range of control protocols in SHADE-Arena, a dataset of diverse agentic environments. First, we evaluate blue team protocols, including deferral to trusted models, resampling, and deferring on critical actions, against a default attack policy. We find that resampling for incrimination and deferring on critical actions perform best, increasing safety from 50% to 96%. We then iterate on red team strategies against these protocols and find that attack policies with additional affordances, such as knowledge of when resampling occurs or the ability to simulate monitors, can substantially improve attack success rates against our resampling strategy, decreasing safety to 17%. However, deferring on critical actions is highly robust to even our strongest red team strategies, demonstrating the importance of denying attack policies access to protocol internals.

Key Contributions

Refined defer-to-resample blue-team protocol that eliminates accepting resampled actions and provides monitors with full trajectory resampling history, improving safety from 83% to 87%
Critical-action targeting protocol that focuses monitoring on the most consequential agent actions, achieving 91% safety that remains robust against sophisticated red-team strategies
Adaptive red-team strategies showing that attack policies with knowledge of resampling events or the ability to simulate monitors can reduce safety from 89% to 17%, demonstrating the critical importance of denying protocol-internal information to attackers

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llm

Threat Tags

white_boxgrey_boxinference_time

Datasets

SHADE-Arena

Applications

agentic ai systemsai agent safety monitoringautonomous agent deployment

Read PDF arXiv DOI

Evaluating Control Protocols for Untrusted AI Agents

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

BashArena: A Control Setting for Highly Privileged AI Agents

NEST: Nascent Encoded Steganographic Thoughts

Agentic Misalignment: How LLMs Could Be Insider Threats

Language Models Identify Ambiguities and Exploit Loopholes

How does information access affect LLM monitors' ability to detect sabotage?

Password-Activated Shutdown Protocols for Misaligned Frontier Agents

Basic Legibility Protocols Improve Trusted Monitoring

Benchmarking the Robustness of Agentic Systems to Adversarially-Induced Harms