attack 2026

TrailBlazer: History-Guided Reinforcement Learning for Black-Box LLM Jailbreaking

Sung-Hoon Yoon ^1,2, Ruizhi Qian ³, Minda Zhao ¹, Weiyue Li ¹, Mengyu Wang ¹

¹ Harvard University

² Daegu Gyeongbuk Institute of Science and Technology

³ University of Southern California

0 citations · 56 references · arXiv (Cornell University)

Published on arXiv

2602.06440

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Achieves state-of-the-art jailbreak performance on AdvBench and HarmBench while requiring significantly fewer queries than prior RL-based jailbreak methods.

TrailBlazer

Novel technique introduced

Large Language Models (LLMs) have become integral to many domains, making their safety a critical priority. Prior jailbreaking research has explored diverse approaches, including prompt optimization, automated red teaming, obfuscation, and reinforcement learning (RL) based methods. However, most existing techniques fail to effectively leverage vulnerabilities revealed in earlier interaction turns, resulting in inefficient and unstable attacks. Since jailbreaking involves sequential interactions in which each response influences future actions, reinforcement learning provides a natural framework for this problem. Motivated by this, we propose a history-aware RL-based jailbreak framework that analyzes and reweights vulnerability signals from prior steps to guide future decisions. We show that incorporating historical information alone improves jailbreak success rates. Building on this insight, we introduce an attention-based reweighting mechanism that highlights critical vulnerabilities within the interaction history, enabling more efficient exploration with fewer queries. Extensive experiments on AdvBench and HarmBench demonstrate that our method achieves state-of-the-art jailbreak performance while significantly improving query efficiency. These results underscore the importance of historical vulnerability signals in reinforcement learning-driven jailbreak strategies and offer a principled pathway for advancing adversarial research on LLM safeguards.

Key Contributions

History-aware RL jailbreak framework that accumulates and reweights vulnerability signals across sequential interaction turns
Attention-based reweighting mechanism that highlights the most critical vulnerabilities in interaction history to guide future queries
State-of-the-art jailbreak success rates on AdvBench and HarmBench with significantly improved query efficiency over prior RL-based methods

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llm

Threat Tags

black_boxinference_timetargeted

Datasets

AdvBenchHarmBench

Applications

llm safety mechanismschatbot safeguards

Read PDF arXiv DOI

TrailBlazer: History-Guided Reinforcement Learning for Black-Box LLM Jailbreaking

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Boundary Point Jailbreaking of Black-Box LLMs

When "Correct" Is Not Safe: Can We Trust Functionally Correct Patches Generated by Code Agents?

Malicious Repurposing of Open Science Artefacts by Using Large Language Models

PINA: Prompt Injection Attack against Navigation Agents

Anecdoctoring: Automated Red-Teaming Across Language and Place

Learning to Inject: Automated Prompt Injection via Reinforcement Learning

Uncovering the Vulnerability of Large Language Models in the Financial Domain via Risk Concealment

Casting a SPELL: Sentence Pairing Exploration for LLM Limitation-breaking