attack 2025

RL Is a Hammer and LLMs Are Nails: A Simple Reinforcement Learning Recipe for Strong Prompt Injection

Yuxin Wen 1,2, Arman Zharmagambetov 2, Ivan Evtimov 2, Narine Kokhlikyan 2, Tom Goldstein 1, Kamalika Chaudhuri 2, Chuan Guo 2

9 citations · 48 references · arXiv

α

Published on arXiv

2510.04885

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

RL-Hammer achieves 98% ASR against GPT-4o and 72% ASR against GPT-5 under the Instruction Hierarchy defense, demonstrating that prior defenses considered highly robust are fragile against strong automated attackers.

RL-Hammer

Novel technique introduced


Prompt injection poses a serious threat to the reliability and safety of LLM agents. Recent defenses against prompt injection, such as Instruction Hierarchy and SecAlign, have shown notable robustness against static attacks. However, to more thoroughly evaluate the robustness of these defenses, it is arguably necessary to employ strong attacks such as automated red-teaming. To this end, we introduce RL-Hammer, a simple recipe for training attacker models that automatically learn to perform strong prompt injections and jailbreaks via reinforcement learning. RL-Hammer requires no warm-up data and can be trained entirely from scratch. To achieve high ASRs against industrial-level models with defenses, we propose a set of practical techniques that enable highly effective, universal attacks. Using this pipeline, RL-Hammer reaches a 98% ASR against GPT-4o and a $72\%$ ASR against GPT-5 with the Instruction Hierarchy defense. We further discuss the challenge of achieving high diversity in attacks, highlighting how attacker models tend to reward-hack diversity objectives. Finally, we show that RL-Hammer can evade multiple prompt injection detectors. We hope our work advances automatic red-teaming and motivates the development of stronger, more principled defenses. Code is available at https://github.com/facebookresearch/rl-injector.


Key Contributions

  • RL-Hammer: a GRPO-based recipe for training prompt injection attacker models entirely from scratch with no warm-up data, achieving 98% ASR against GPT-4o and 72% against GPT-5 with Instruction Hierarchy defense
  • A bag of tricks (KL regularization removal, joint easy/robust target training, format enforcement) that overcomes sparse reward problems when attacking defended commercial LLMs
  • Diversity and detectability analyses showing that attacker models reward-hack diversity metrics and can be trained to evade all four tested prompt injection detectors while preserving high ASR

🛡️ Threat Analysis


Details

Domains
nlp
Model Types
llmtransformerrl
Threat Tags
black_boxinference_time
Datasets
InjecAgentAgentDojoWASPAgentDAM
Applications
llm agentsagentic ai systems