defense 2025

Online Learning Defense against Iterative Jailbreak Attacks via Prompt Optimization

Masahiro Kaneko ¹, Zeerak Talat ², Timothy Baldwin ¹

¹ MBZUAI

² University of Edinburgh

3 citations · 47 references · IJCNLP-AACL

Published on arXiv

2510.17006

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

The proposed online learning defense significantly outperforms five existing defense methods against five iterative jailbreak methods across GPT-4, OLMo 2, and Llama 3, while also improving response quality on harmless tasks.

Past-Direction Gradient Damping (PDGD)

Novel technique introduced

Iterative jailbreak methods that repeatedly rewrite and input prompts into large language models (LLMs) to induce harmful outputs -- using the model's previous responses to guide each new iteration -- have been found to be a highly effective attack strategy. Despite being an effective attack strategy against LLMs and their safety mechanisms, existing defenses do not proactively disrupt this dynamic trial-and-error cycle. In this study, we propose a novel framework that dynamically updates its defense strategy through online learning in response to each new prompt from iterative jailbreak methods. Leveraging the distinctions between harmful jailbreak-generated prompts and typical harmless prompts, we introduce a reinforcement learning-based approach that optimizes prompts to ensure appropriate responses for harmless tasks while explicitly rejecting harmful prompts. Additionally, to curb overfitting to the narrow band of partial input rewrites explored during an attack, we introduce Past-Direction Gradient Damping (PDGD). Experiments conducted on three LLMs show that our approach significantly outperforms five existing defense methods against five iterative jailbreak methods. Moreover, our results indicate that our prompt optimization strategy simultaneously enhances response quality for harmless tasks.

Key Contributions

Online learning framework that dynamically updates its defense prompt-rewriting strategy after each iterative jailbreak query, proactively disrupting the attacker's trial-and-error optimization loop
RL-based prompt optimization objective that simultaneously rejects harmful prompts and improves response quality on harmless tasks, challenging the assumed safety–utility trade-off
Past-Direction Gradient Damping (PDGD) regularization that penalizes gradient updates aligned with past update directions to prevent overfitting to the narrow prompt-rewrite distribution explored during an attack

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llmtransformerrl

Threat Tags

black_boxinference_time

Datasets

OpenAssistant

Applications

large language modelsllm safety guardrailschatbots

Read PDF arXiv DOI

Online Learning Defense against Iterative Jailbreak Attacks via Prompt Optimization

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

SDGO: Self-Discrimination-Guided Optimization for Consistent Safety in Large Language Models

Be Your Own Red Teamer: Safety Alignment via Self-Play and Reflective Experience Replay

Learning to Extract Context for Context-Aware LLM Inference

Safety Alignment of LMs via Non-cooperative Games

ReasAlign: Reasoning Enhanced Safety Alignment against Prompt Injection Attack

CommandSans: Securing AI Agents with Surgical Precision Prompt Sanitization

SeCon-RAG: A Two-Stage Semantic Filtering and Conflict-Free Framework for Trustworthy RAG

CivicShield: A Cross-Domain Defense-in-Depth Framework for Securing Government-Facing AI Chatbots Against Multi-Turn Adversarial Attacks