attack 2025

RL Is a Hammer and LLMs Are Nails: A Simple Reinforcement Learning Recipe for Strong Prompt Injection

Yuxin Wen ^1,2, Arman Zharmagambetov ², Ivan Evtimov ², Narine Kokhlikyan ², Tom Goldstein ¹, Kamalika Chaudhuri ², Chuan Guo ²

¹ University of Maryland

² Meta

9 citations · 48 references · arXiv

Published on arXiv

2510.04885

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

RL-Hammer achieves 98% ASR against GPT-4o and 72% ASR against GPT-5 under the Instruction Hierarchy defense, demonstrating that prior defenses considered highly robust are fragile against strong automated attackers.

RL-Hammer

Novel technique introduced

Prompt injection poses a serious threat to the reliability and safety of LLM agents. Recent defenses against prompt injection, such as Instruction Hierarchy and SecAlign, have shown notable robustness against static attacks. However, to more thoroughly evaluate the robustness of these defenses, it is arguably necessary to employ strong attacks such as automated red-teaming. To this end, we introduce RL-Hammer, a simple recipe for training attacker models that automatically learn to perform strong prompt injections and jailbreaks via reinforcement learning. RL-Hammer requires no warm-up data and can be trained entirely from scratch. To achieve high ASRs against industrial-level models with defenses, we propose a set of practical techniques that enable highly effective, universal attacks. Using this pipeline, RL-Hammer reaches a 98% ASR against GPT-4o and a $72\%$ ASR against GPT-5 with the Instruction Hierarchy defense. We further discuss the challenge of achieving high diversity in attacks, highlighting how attacker models tend to reward-hack diversity objectives. Finally, we show that RL-Hammer can evade multiple prompt injection detectors. We hope our work advances automatic red-teaming and motivates the development of stronger, more principled defenses. Code is available at https://github.com/facebookresearch/rl-injector.

Key Contributions

RL-Hammer: a GRPO-based recipe for training prompt injection attacker models entirely from scratch with no warm-up data, achieving 98% ASR against GPT-4o and 72% against GPT-5 with Instruction Hierarchy defense
A bag of tricks (KL regularization removal, joint easy/robust target training, format enforcement) that overcomes sparse reward problems when attacking defended commercial LLMs
Diversity and detectability analyses showing that attacker models reward-hack diversity metrics and can be trained to evade all four tested prompt injection detectors while preserving high ASR

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llmtransformerrl

Threat Tags

black_boxinference_time

Datasets

InjecAgentAgentDojoWASPAgentDAM

Applications

llm agentsagentic ai systems

Read PDF arXiv DOI Code

RL Is a Hammer and LLMs Are Nails: A Simple Reinforcement Learning Recipe for Strong Prompt Injection

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Jailbreaking in the Haystack

Exposing Long-Tail Safety Failures in Large Language Models through Efficient Diverse Response Sampling

Chain-of-Thought Hijacking

ToxSearch: Evolving Prompts for Toxicity Search in Large Language Models

Active Attacks: Red-teaming LLMs via Adaptive Environments

SearchAttack: Red-Teaming LLMs against Knowledge-to-Action Threats under Online Web Search

Special-Character Adversarial Attacks on Open-Source Language Model

Stress-Testing Alignment Audits With Prompt-Level Strategic Deception