benchmark 2026

How Vulnerable Are AI Agents to Indirect Prompt Injections? Insights from a Large-Scale Public Competition

Mateusz Dziemian ¹, Maxwell Lin ¹, Xiaohan Fu ¹, Micha Nowak ¹, Nick Winter ², Eliot Jones ², Andy Zou ^3,4, Lama Ahmad ³, Kamalika Chaudhuri ³, Sahana Chennabasappa ², Xander Davies ⁵, Lauren Deason ⁵, Benjamin L. Edelman ⁶, Tanner Emek ⁵, Ivan Evtimov ⁷, Jim Gust ⁸, Maia Hamin ⁵, Kat He ⁵, Klaudia Krawiecka ⁷, Riccardo Patana ⁵, Neil Perry ⁵, Troy Peterson ⁸, Xiangyu Qi ⁷, Javier Rando ², Zifan Wang ², Zihan Wang ⁸, Spencer Whitman ⁵, Eric Winsor ⁵, Arman Zharmagambetov ⁵, Matt Fredrikson ⁶, Zico Kolter ⁵

¹ Gray Swan AI

² OpenAI

³ Carnegie Mellon University

⁴ Center for AI Safety

⁵ Meta

⁶ UK AISI

⁷ US CAISI

⁸ Anthropic

0 citations

Published on arXiv

2603.15714

Prompt Injection

OWASP LLM Top 10 — LLM01

Excessive Agency

OWASP LLM Top 10 — LLM08

Key Finding

All 13 frontier models vulnerable to indirect prompt injection with success rates from 0.5% (Claude Opus 4.5) to 8.5% (Gemini 2.5 Pro); weak correlation between capability and robustness

LLM based agents are increasingly deployed in high stakes settings where they process external data sources such as emails, documents, and code repositories. This creates exposure to indirect prompt injection attacks, where adversarial instructions embedded in external content manipulate agent behavior without user awareness. A critical but underexplored dimension of this threat is concealment: since users tend to observe only an agent's final response, an attack can conceal its existence by presenting no clue of compromise in the final user facing response while successfully executing harmful actions. This leaves users unaware of the manipulation and likely to accept harmful outcomes as legitimate. We present findings from a large scale public red teaming competition evaluating this dual objective across three agent settings: tool calling, coding, and computer use. The competition attracted 464 participants who submitted 272000 attack attempts against 13 frontier models, yielding 8648 successful attacks across 41 scenarios. All models proved vulnerable, with attack success rates ranging from 0.5% (Claude Opus 4.5) to 8.5% (Gemini 2.5 Pro). We identify universal attack strategies that transfer across 21 of 41 behaviors and multiple model families, suggesting fundamental weaknesses in instruction following architectures. Capability and robustness showed weak correlation, with Gemini 2.5 Pro exhibiting both high capability and high vulnerability. To address benchmark saturation and obsoleteness, we will endeavor to deliver quarterly updates through continued red teaming competitions. We open source the competition environment for use in evaluations, along with 95 successful attacks against Qwen that did not transfer to any closed source model. We share model-specific attack data with respective frontier labs and the full dataset with the UK AISI and US CAISI to support robustness research.

Key Contributions

Large-scale public red teaming competition with 464 participants, 272,000 attack attempts, and 8,648 successful attacks across 13 frontier models and 41 scenarios
Identification of universal attack strategies transferring across 21 of 41 behaviors and multiple model families, suggesting fundamental weaknesses in instruction-following architectures
Open-sourced competition environment and 95 successful attacks against Qwen; shared model-specific data with frontier labs and full dataset with UK AISI and US CAISI

🛡️ Threat Analysis

Details

Domains

nlpmultimodal

Model Types

llm

Threat Tags

black_boxinference_timetargeted

Datasets

41 agent scenarios272,000 attack attempts8,648 successful attacks

Applications

llm agentstool callingcode generationcomputer use automation

Read PDF arXiv

How Vulnerable Are AI Agents to Indirect Prompt Injections? Insights from a Large-Scale Public Competition

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

ClawSafety: "Safe" LLMs, Unsafe Agents

Beyond Model Jailbreak: Systematic Dissection of the "Ten DeadlySins" in Embodied Intelligence

$α^3$-SecBench: A Large-Scale Evaluation Suite of Security, Resilience, and Trust for LLM-based UAV Agents over 6G Networks

SecureWebArena: A Holistic Security Evaluation Benchmark for LVLM-based Web Agents

ScamAgents: How AI Agents Can Simulate Human-Level Scam Calls

Code Agent can be an End-to-end System Hacker: Benchmarking Real-world Threats of Computer-use Agent

Mind the Gap: Comparing Model- vs Agentic-Level Red Teaming with Action-Graph Observability on GPT-OSS-20B

PEAR: Planner-Executor Agent Robustness Benchmark