defense 2025

Retrieval-Augmented Defense: Adaptive and Controllable Jailbreak Prevention for Large Language Models

Guangyu Yang , Jinghong Chen , Jingbiao Mei , Weizhe Lin , Bill Byrne

University of Cambridge

0 citations

Published on arXiv

2508.16406

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

RAD substantially reduces effectiveness of PAP and PAIR jailbreak attacks while maintaining low false rejection rates for benign queries across controllable safety-utility operating points.

RAD (Retrieval-Augmented Defense)

Novel technique introduced

Large Language Models (LLMs) remain vulnerable to jailbreak attacks, which attempt to elicit harmful responses from LLMs. The evolving nature and diversity of these attacks pose many challenges for defense systems, including (1) adaptation to counter emerging attack strategies without costly retraining, and (2) control of the trade-off between safety and utility. To address these challenges, we propose Retrieval-Augmented Defense (RAD), a novel framework for jailbreak detection that incorporates a database of known attack examples into Retrieval-Augmented Generation, which is used to infer the underlying, malicious user query and jailbreak strategy used to attack the system. RAD enables training-free updates for newly discovered jailbreak strategies and provides a mechanism to balance safety and utility. Experiments on StrongREJECT show that RAD substantially reduces the effectiveness of strong jailbreak attacks such as PAP and PAIR while maintaining low rejection rates for benign queries. We propose a novel evaluation scheme and show that RAD achieves a robust safety-utility trade-off across a range of operating points in a controllable manner.

Key Contributions

Retrieval-Augmented Defense (RAD) framework that retrieves similar known jailbreak examples to infer malicious intent without model retraining
Training-free update mechanism enabling RAD to adapt to newly discovered jailbreak strategies by simply adding them to the database
Tunable decision threshold providing controllable safety-utility trade-off, evaluated via a novel operating-curve-based benchmark

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

inference_timeblack_box

Datasets

StrongREJECT

Applications

llm safetyjailbreak detectionchatbot guardrails

Read PDF arXiv Code

Retrieval-Augmented Defense: Adaptive and Controllable Jailbreak Prevention for Large Language Models

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

PromptSleuth: Detecting Prompt Injection via Semantic Intent Invariance

Proactive defense against LLM Jailbreak

RedVisor: Reasoning-Aware Prompt Injection Defense via Zero-Copy KV Cache Reuse

FocusAgent: Simple Yet Effective Ways of Trimming the Large Context of Web Agents

Sentra-Guard: A Multilingual Human-AI Framework for Real-Time Defense Against Adversarial LLM Jailbreaks

From static to adaptive: immune memory-based jailbreak detection for large language models

Knowing When Not to Answer: Lightweight KB-Aligned OOD Detection for Safe RAG

Defend LLMs Through Self-Consciousness