Bypassing AI Control Protocols via Agent-as-a-Proxy Attacks

As AI agents automate critical workloads, they remain vulnerable to indirect prompt injection (IPI) attacks. Current defenses rely on monitoring protocols that jointly evaluate an agent's Chain-of-Thought (CoT) and tool-use actions to ensure alignment with user intent. We demonstrate that these monitoring-based defenses can be bypassed via a novel Agent-as-a-Proxy attack, where prompt injection attacks treat the agent as a delivery mechanism, bypassing both agent and monitor simultaneously. While prior work on scalable oversight has focused on whether small monitors can supervise large agents, we show that even frontier-scale monitors are vulnerable. Large-scale monitoring models like Qwen2.5-72B can be bypassed by agents with similar capabilities, such as GPT-4o mini and Llama-3.1-70B. On the AgentDojo benchmark, we achieve a high attack success rate against AlignmentCheck and Extract-and-Evaluate monitors under diverse monitoring LLMs. Our findings suggest current monitoring-based agentic defenses are fundamentally fragile regardless of model scale.

Key Contributions

Agent-as-a-Proxy attack framework that coerces the victim LLM agent to echo adversarial strings in its CoT and tool parameters, weaponizing the agent as a delivery mechanism against its own monitor.
Parallel-GCG optimization algorithm that generates adversarial strings robust across multiple distinct contexts (reasoning traces, function arguments, execution logs) observed by a monitor during an agentic trajectory.
Empirical demonstration that frontier-scale hybrid monitors (Qwen2.5-72B) are bypassed with >90% ASR on AgentDojo, invalidating the model-scale safety hypothesis for monitoring-based defenses.

🛡️ Threat Analysis

Input Manipulation Attack

Parallel-GCG is a gradient-based adversarial suffix optimization algorithm (GCG variant) that crafts token-level perturbations to subvert LLM monitors — a direct ML01 adversarial suffix attack on LLMs.

Details

Domains

nlp

Model Types

llm

Threat Tags

white_boxinference_timetargeteddigital

Datasets

AgentDojo

Applications

2025 0 cit.

Input Manipulation Attack

93%