attack 2026

Learning to Inject: Automated Prompt Injection via Reinforcement Learning

Xin Chen , Jie Zhang , Robert Mullins

0 citations · 28 references · arXiv (Cornell University)

α

Published on arXiv

2602.05746

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

A 1.5B parameter RL-trained attack generator successfully compromises frontier LLM agent systems including GPT 5 Nano, Claude Sonnet 3.5, and Gemini 2.5 Flash on AgentDojo, establishing a new stronger baseline for automated prompt injection.

AutoInject

Novel technique introduced


Prompt injection is one of the most critical vulnerabilities in LLM agents; yet, effective automated attacks remain largely unexplored from an optimization perspective. Existing methods heavily depend on human red-teamers and hand-crafted prompts, limiting their scalability and adaptability. We propose AutoInject, a reinforcement learning framework that generates universal, transferable adversarial suffixes while jointly optimizing for attack success and utility preservation on benign tasks. Our black-box method supports both query-based optimization and transfer attacks to unseen models and tasks. Using only a 1.5B parameter adversarial suffix generator, we successfully compromise frontier systems including GPT 5 Nano, Claude Sonnet 3.5, and Gemini 2.5 Flash on the AgentDojo benchmark, establishing a stronger baseline for automated prompt injection research.


Key Contributions

  • AutoInject: an RL framework using a compact 1.5B parameter model to generate universal, transferable adversarial suffixes for prompt injection attacks
  • Joint reward optimization balancing attack success rate and benign task utility preservation, enabling stealthy targeted hijacking rather than general model disruption
  • Demonstrated black-box transfer attacks that compromise frontier LLM agents (GPT 5 Nano, Claude Sonnet 3.5, Gemini 2.5 Flash) on the AgentDojo benchmark, outperforming GCG, TAP, and random adaptive baselines

🛡️ Threat Analysis


Details

Domains
nlp
Model Types
llm
Threat Tags
black_boxinference_timetargeted
Datasets
AgentDojo
Applications
llm agentsautonomous ai systems