defense 2025

Factor(U,T): Controlling Untrusted AI by Monitoring their Plans

Edward Lue Chee Lip 1,2, Anthony Channg 1, Diana Kim 1,3, Aaron Sandoval 1, Kevin Zhu 1

0 citations · 15 references · arXiv

α

Published on arXiv

2512.14745

Excessive Agency

OWASP LLM Top 10 — LLM08

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Decomposition-only monitoring achieves AUROC 0.52 (near random) vs. 0.96 for full-implementation monitoring; Factor(D,U) with implementation-context monitoring achieves 1.2% attack success rate.

Factor(U,T)

Novel technique introduced


As AI capabilities advance, we increasingly rely on powerful models to decompose complex tasks $\unicode{x2013}$ but what if the decomposer itself is malicious? Factored cognition protocols decompose complex tasks into simpler child tasks: one model creates the decomposition, while other models implement the child tasks in isolation. Prior work uses trusted (weaker but reliable) models for decomposition, which limits usefulness for tasks where decomposition itself is challenging. We introduce Factor($U$,$T$), in which an untrusted (stronger but potentially malicious) model decomposes while trusted models implement child tasks. Can monitors detect malicious activity when observing only natural language task instructions, rather than complete solutions? We baseline and red team Factor($U$,$T$) in control evaluations on BigCodeBench, a dataset of Python coding tasks. Monitors distinguishing malicious from honest decompositions perform poorly (AUROC 0.52) compared to monitors evaluating complete Python solutions (AUROC 0.96). Furthermore, Factor($D$,$U$), which uses a trusted decomposer and monitors concrete child solutions, achieves excellent discrimination (AUROC 0.96) and strong safety (1.2% ASR), demonstrating that implementation-context monitoring succeeds where decomposition-only monitoring fails.


Key Contributions

  • Introduces Factor(U,T), a factored cognition control protocol where an untrusted (stronger) model decomposes tasks and trusted models implement child tasks in isolation, and red-teams it against malicious decomposers.
  • Demonstrates that decomposition-only monitoring is fundamentally limited (AUROC 0.52) because monitors cannot detect malicious intent from abstract plans without implementation context.
  • Shows that Factor(D,U) — monitoring concrete child implementations rather than abstract decompositions — achieves strong safety (1.2% ASR) and high discrimination (AUROC 0.96), establishing that implementation-context monitoring is necessary.

🛡️ Threat Analysis


Details

Domains
nlp
Model Types
llm
Threat Tags
grey_boxinference_time
Datasets
BigCodeBench
Applications
multi-agent coding systemsllm agent pipelinesfactored cognition protocols