defense 2025

Factor(U,T): Controlling Untrusted AI by Monitoring their Plans

Edward Lue Chee Lip ^1,2, Anthony Channg ¹, Diana Kim ^1,3, Aaron Sandoval ¹, Kevin Zhu ¹

¹ Algoverse AI Research

² Colorado State University

³ Orange Coast College

0 citations · 15 references · arXiv

Published on arXiv

2512.14745

Excessive Agency

OWASP LLM Top 10 — LLM08

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Decomposition-only monitoring achieves AUROC 0.52 (near random) vs. 0.96 for full-implementation monitoring; Factor(D,U) with implementation-context monitoring achieves 1.2% attack success rate.

Factor(U,T)

Novel technique introduced

As AI capabilities advance, we increasingly rely on powerful models to decompose complex tasks $\unicode{x2013}$ but what if the decomposer itself is malicious? Factored cognition protocols decompose complex tasks into simpler child tasks: one model creates the decomposition, while other models implement the child tasks in isolation. Prior work uses trusted (weaker but reliable) models for decomposition, which limits usefulness for tasks where decomposition itself is challenging. We introduce Factor($U$,$T$), in which an untrusted (stronger but potentially malicious) model decomposes while trusted models implement child tasks. Can monitors detect malicious activity when observing only natural language task instructions, rather than complete solutions? We baseline and red team Factor($U$,$T$) in control evaluations on BigCodeBench, a dataset of Python coding tasks. Monitors distinguishing malicious from honest decompositions perform poorly (AUROC 0.52) compared to monitors evaluating complete Python solutions (AUROC 0.96). Furthermore, Factor($D$,$U$), which uses a trusted decomposer and monitors concrete child solutions, achieves excellent discrimination (AUROC 0.96) and strong safety (1.2% ASR), demonstrating that implementation-context monitoring succeeds where decomposition-only monitoring fails.

Key Contributions

Introduces Factor(U,T), a factored cognition control protocol where an untrusted (stronger) model decomposes tasks and trusted models implement child tasks in isolation, and red-teams it against malicious decomposers.
Demonstrates that decomposition-only monitoring is fundamentally limited (AUROC 0.52) because monitors cannot detect malicious intent from abstract plans without implementation context.
Shows that Factor(D,U) — monitoring concrete child implementations rather than abstract decompositions — achieves strong safety (1.2% ASR) and high discrimination (AUROC 0.96), establishing that implementation-context monitoring is necessary.

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llm

Threat Tags

grey_boxinference_time

Datasets

BigCodeBench

Applications

multi-agent coding systemsllm agent pipelinesfactored cognition protocols

Read PDF arXiv DOI Code

Factor(U,T): Controlling Untrusted AI by Monitoring their Plans

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Verifiability-First Agents: Provable Observability and Lightweight Audit Agents for Controlling Autonomous LLM Systems

Async Control: Stress-testing Asynchronous Control Measures for LLM Agents

The LLMbda Calculus: AI Agents, Conversations, and Information Flow

Securing AI Agents: Implementing Role-Based Access Control for Industrial Applications

BlockA2A: Towards Secure and Verifiable Agent-to-Agent Interoperability

Agent-Sentry: Bounding LLM Agents via Execution Provenance

Optimizing Agent Planning for Security and Autonomy

Policy Compiler for Secure Agentic Systems