attack 2025

Active Attacks: Red-teaming LLMs via Adaptive Environments

1 citations · 42 references · arXiv

Published on arXiv

2509.21947

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Active Attacks improves cross-attack success rates against GFlowNets from 0.07% to 31.28% (>400x relative gain) while requiring only 6% additional computation over the baseline.

Active Attacks

Novel technique introduced

We address the challenge of generating diverse attack prompts for large language models (LLMs) that elicit harmful behaviors (e.g., insults, sexual content) and are used for safety fine-tuning. Rather than relying on manual prompt engineering, attacker LLMs can be trained with reinforcement learning (RL) to automatically generate such prompts using only a toxicity classifier as a reward. However, capturing a wide range of harmful behaviors is a significant challenge that requires explicit diversity objectives. Existing diversity-seeking RL methods often collapse to limited modes: once high-reward prompts are found, exploration of new regions is discouraged. Inspired by the active learning paradigm that encourages adaptive exploration, we introduce \textit{Active Attacks}, a novel RL-based red-teaming algorithm that adapts its attacks as the victim evolves. By periodically safety fine-tuning the victim LLM with collected attack prompts, rewards in exploited regions diminish, which forces the attacker to seek unexplored vulnerabilities. This process naturally induces an easy-to-hard exploration curriculum, where the attacker progresses beyond easy modes toward increasingly difficult ones. As a result, Active Attacks uncovers a wide range of local attack modes step by step, and their combination achieves wide coverage of the multi-mode distribution. Active Attacks, a simple plug-and-play module that seamlessly integrates into existing RL objectives, unexpectedly outperformed prior RL-based methods -- including GFlowNets, PPO, and REINFORCE -- by improving cross-attack success rates against GFlowNets, the previous state-of-the-art, from 0.07% to 31.28% (a relative gain greater than $400\ \times$) with only a 6% increase in computation. Our code is publicly available \href{https://github.com/dbsxodud-11/active_attacks}{here}.

Key Contributions

Active Attacks: an RL-based red-teaming framework that periodically safety fine-tunes the victim LLM to flatten exploited reward regions, forcing the attacker toward undiscovered vulnerabilities
Easy-to-hard exploration curriculum that naturally emerges from adaptive victim updates, enabling diverse multi-mode coverage of harmful behaviors
Plug-and-play integration with existing RL objectives (GFlowNets, PPO, REINFORCE) achieving 440x improvement in cross-attack success rate over GFlowNets SOTA with only 6% more computation

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llmrl

Threat Tags

black_boxinference_time

Datasets

RealToxicityPrompts

Applications

llm safety fine-tuningautomated red-teamingchatbots

Read PDF arXiv DOI Code

Active Attacks: Red-teaming LLMs via Adaptive Environments

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Auto-RT: Automatic Jailbreak Strategy Exploration for Red-Teaming Large Language Models

RL Is a Hammer and LLMs Are Nails: A Simple Reinforcement Learning Recipe for Strong Prompt Injection

Formalization Driven LLM Prompt Jailbreaking via Reinforcement Learning

Uncovering the Persuasive Fingerprint of LLMs in Jailbreaking Attacks

Beyond Context: Large Language Models Failure to Grasp Users Intent

When Search Goes Wrong: Red-Teaming Web-Augmented Large Language Models

Jailbreaking Commercial Black-Box LLMs with Explicitly Harmful Prompts

LLM Security and Safety: Insights from Homotopy-Inspired Prompt Obfuscation