Learning to Inject: Automated Prompt Injection via Reinforcement Learning

Prompt injection is one of the most critical vulnerabilities in LLM agents; yet, effective automated attacks remain largely unexplored from an optimization perspective. Existing methods heavily depend on human red-teamers and hand-crafted prompts, limiting their scalability and adaptability. We propose AutoInject, a reinforcement learning framework that generates universal, transferable adversarial suffixes while jointly optimizing for attack success and utility preservation on benign tasks. Our black-box method supports both query-based optimization and transfer attacks to unseen models and tasks. Using only a 1.5B parameter adversarial suffix generator, we successfully compromise frontier systems including GPT 5 Nano, Claude Sonnet 3.5, and Gemini 2.5 Flash on the AgentDojo benchmark, establishing a stronger baseline for automated prompt injection research.

Key Contributions

AutoInject: an RL framework using a compact 1.5B parameter model to generate universal, transferable adversarial suffixes for prompt injection attacks
Joint reward optimization balancing attack success rate and benign task utility preservation, enabling stealthy targeted hijacking rather than general model disruption
Demonstrated black-box transfer attacks that compromise frontier LLM agents (GPT 5 Nano, Claude Sonnet 3.5, Gemini 2.5 Flash) on the AgentDojo benchmark, outperforming GCG, TAP, and random adaptive baselines

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llm

Threat Tags

black_boxinference_timetargeted

Datasets

AgentDojo

Applications

llm agentsautonomous ai systems

2025 0 cit.

100%