Learning to Inject: Automated Prompt Injection via Reinforcement Learning
Xin Chen , Jie Zhang , Robert Mullins
Published on arXiv
2602.05746
Prompt Injection
OWASP LLM Top 10 — LLM01
Key Finding
A 1.5B parameter RL-trained attack generator successfully compromises frontier LLM agent systems including GPT 5 Nano, Claude Sonnet 3.5, and Gemini 2.5 Flash on AgentDojo, establishing a new stronger baseline for automated prompt injection.
AutoInject
Novel technique introduced
Prompt injection is one of the most critical vulnerabilities in LLM agents; yet, effective automated attacks remain largely unexplored from an optimization perspective. Existing methods heavily depend on human red-teamers and hand-crafted prompts, limiting their scalability and adaptability. We propose AutoInject, a reinforcement learning framework that generates universal, transferable adversarial suffixes while jointly optimizing for attack success and utility preservation on benign tasks. Our black-box method supports both query-based optimization and transfer attacks to unseen models and tasks. Using only a 1.5B parameter adversarial suffix generator, we successfully compromise frontier systems including GPT 5 Nano, Claude Sonnet 3.5, and Gemini 2.5 Flash on the AgentDojo benchmark, establishing a stronger baseline for automated prompt injection research.
Key Contributions
- AutoInject: an RL framework using a compact 1.5B parameter model to generate universal, transferable adversarial suffixes for prompt injection attacks
- Joint reward optimization balancing attack success rate and benign task utility preservation, enabling stealthy targeted hijacking rather than general model disruption
- Demonstrated black-box transfer attacks that compromise frontier LLM agents (GPT 5 Nano, Claude Sonnet 3.5, Gemini 2.5 Flash) on the AgentDojo benchmark, outperforming GCG, TAP, and random adaptive baselines