α

Published on arXiv

2603.15714

Prompt Injection

OWASP LLM Top 10 — LLM01

Excessive Agency

OWASP LLM Top 10 — LLM08

Key Finding

All 13 frontier models vulnerable to indirect prompt injection with success rates from 0.5% (Claude Opus 4.5) to 8.5% (Gemini 2.5 Pro); weak correlation between capability and robustness


LLM based agents are increasingly deployed in high stakes settings where they process external data sources such as emails, documents, and code repositories. This creates exposure to indirect prompt injection attacks, where adversarial instructions embedded in external content manipulate agent behavior without user awareness. A critical but underexplored dimension of this threat is concealment: since users tend to observe only an agent's final response, an attack can conceal its existence by presenting no clue of compromise in the final user facing response while successfully executing harmful actions. This leaves users unaware of the manipulation and likely to accept harmful outcomes as legitimate. We present findings from a large scale public red teaming competition evaluating this dual objective across three agent settings: tool calling, coding, and computer use. The competition attracted 464 participants who submitted 272000 attack attempts against 13 frontier models, yielding 8648 successful attacks across 41 scenarios. All models proved vulnerable, with attack success rates ranging from 0.5% (Claude Opus 4.5) to 8.5% (Gemini 2.5 Pro). We identify universal attack strategies that transfer across 21 of 41 behaviors and multiple model families, suggesting fundamental weaknesses in instruction following architectures. Capability and robustness showed weak correlation, with Gemini 2.5 Pro exhibiting both high capability and high vulnerability. To address benchmark saturation and obsoleteness, we will endeavor to deliver quarterly updates through continued red teaming competitions. We open source the competition environment for use in evaluations, along with 95 successful attacks against Qwen that did not transfer to any closed source model. We share model-specific attack data with respective frontier labs and the full dataset with the UK AISI and US CAISI to support robustness research.


Key Contributions

  • Large-scale public red teaming competition with 464 participants, 272,000 attack attempts, and 8,648 successful attacks across 13 frontier models and 41 scenarios
  • Identification of universal attack strategies transferring across 21 of 41 behaviors and multiple model families, suggesting fundamental weaknesses in instruction-following architectures
  • Open-sourced competition environment and 95 successful attacks against Qwen; shared model-specific data with frontier labs and full dataset with UK AISI and US CAISI

🛡️ Threat Analysis


Details

Domains
nlpmultimodal
Model Types
llm
Threat Tags
black_boxinference_timetargeted
Datasets
41 agent scenarios272,000 attack attempts8,648 successful attacks
Applications
llm agentstool callingcode generationcomputer use automation