benchmark 2025

Reliable Weak-to-Strong Monitoring of LLM Agents

Neil Kale ^1,2, Chen Bo Calvin Zhang ¹, Kevin Zhu ^1,3, Ankit Aich ¹, Paula Rodriguez ¹, Scale Red Team ¹, Christina Q. Knight ¹, Zifan Wang ¹

¹ Scale AI

² Carnegie Mellon University

³ Massachusetts Institute of Technology

0 citations

Published on arXiv

2508.19461

Excessive Agency

OWASP LLM Top 10 — LLM08

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Hybrid hierarchical-sequential monitor scaffolding enables weak-to-strong monitoring, and targeted human escalation of pre-flagged cases improves true positive rate by ~15% at FPR=0.01; agent awareness of monitoring degrades monitor reliability more than additional monitor context helps it.

Monitor Red Teaming (MRT)

Novel technique introduced

We stress test monitoring systems for detecting covert misbehavior in autonomous LLM agents (e.g., secretly sharing private information). To this end, we systematize a monitor red teaming (MRT) workflow that incorporates: (1) varying levels of agent and monitor situational awareness; (2) distinct adversarial strategies to evade the monitor, such as prompt injection; and (3) two datasets and environments -- SHADE-Arena for tool-calling agents and our new CUA-SHADE-Arena, which extends TheAgentCompany, for computer-use agents. We run MRT on existing LLM monitor scaffoldings, which orchestrate LLMs and parse agent trajectories, alongside a new hybrid hierarchical-sequential scaffolding proposed in this work. Our empirical results yield three key findings. First, agent awareness dominates monitor awareness: an agent's knowledge that it is being monitored substantially degrades the monitor's reliability. On the contrary, providing the monitor with more information about the agent is less helpful than expected. Second, monitor scaffolding matters more than monitor awareness: the hybrid scaffolding consistently outperforms baseline monitor scaffolding, and can enable weaker models to reliably monitor stronger agents -- a weak-to-strong scaling effect. Third, in a human-in-the-loop setting where humans discuss with the LLM monitor to get an updated judgment for the agent's behavior, targeted human oversight is most effective; escalating only pre-flagged cases to human reviewers improved the TPR by approximately 15% at FPR = 0.01. Our work establishes a standard workflow for MRT, highlighting the lack of adversarial robustness for LLMs and humans when monitoring and detecting agent misbehavior. We release code, data, and logs to spur further research.

Key Contributions

A systematic Monitor Red Teaming (MRT) workflow that stress-tests LLM agent monitors under varying agent/monitor awareness levels and adversarial evasion strategies including prompt injection
A hybrid hierarchical-sequential monitor scaffolding that enables weaker LLMs to reliably monitor stronger agents (weak-to-strong scaling effect), outperforming baseline scaffolding architectures
Human-in-the-loop analysis showing that targeted escalation of pre-flagged cases to human reviewers improves TPR by ~15% at FPR=0.01 compared to untargeted oversight

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llm

Threat Tags

inference_timetargeted

Datasets

SHADE-ArenaCUA-SHADE-ArenaTheAgentCompany

Applications

llm agentstool-calling agentscomputer-use agents

Read PDF arXiv

Reliable Weak-to-Strong Monitoring of LLM Agents

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Securing Agentic AI: Threat Modeling and Risk Analysis for Network Monitoring Agentic AI System

You Told Me to Do It: Measuring Instructional Text-induced Private Data Leakage in LLM Agents

TradeTrap: Are LLM-based Trading Agents Truly Reliable and Faithful?

FinVault: Benchmarking Financial Agent Safety in Execution-Grounded Environments

PEAR: Planner-Executor Agent Robustness Benchmark

It's a TRAP! Task-Redirecting Agent Persuasion Benchmark for Web Agents

Benchmarking the Robustness of Agentic Systems to Adversarially-Induced Harms

Mind the Gap: Comparing Model- vs Agentic-Level Red Teaming with Action-Graph Observability on GPT-OSS-20B