defense 2025

AdaptiveGuard: Towards Adaptive Runtime Safety for LLM-Powered Software

Rui Yang ¹, Michael Fu ², Chakkrit Tantithamthavorn ¹, Chetan Arora ¹, Gunel Gulmammadova ³, Joey Chua ³

¹ Monash University

² The University of Melbourne

³ Transurban

0 citations

Published on arXiv

2509.16861

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

AdaptiveGuard achieves 96% OOD detection accuracy and reaches 100% defense success rate within a median of 2 continual learning update steps, outperforming LlamaGuard which requires 4 steps while retaining 85% F1 on in-distribution data.

AdaptiveGuard

Novel technique introduced

Guardrails are critical for the safe deployment of Large Language Models (LLMs)-powered software. Unlike traditional rule-based systems with limited, predefined input-output spaces that inherently constrain unsafe behavior, LLMs enable open-ended, intelligent interactions--opening the door to jailbreak attacks through user inputs. Guardrails serve as a protective layer, filtering unsafe prompts before they reach the LLM. However, prior research shows that jailbreak attacks can still succeed over 70% of the time, even against advanced models like GPT-4o. While guardrails such as LlamaGuard report up to 95% accuracy, our preliminary analysis shows their performance can drop sharply--to as low as 12%--when confronted with unseen attacks. This highlights a growing software engineering challenge: how to build a post-deployment guardrail that adapts dynamically to emerging threats? To address this, we propose AdaptiveGuard, an adaptive guardrail that detects novel jailbreak attacks as out-of-distribution (OOD) inputs and learns to defend against them through a continual learning framework. Through empirical evaluation, AdaptiveGuard achieves 96% OOD detection accuracy, adapts to new attacks in just two update steps, and retains over 85% F1-score on in-distribution data post-adaptation, outperforming other baselines. These results demonstrate that AdaptiveGuard is a guardrail capable of evolving in response to emerging jailbreak strategies post deployment. We release our AdaptiveGuard and studied datasets at https://github.com/awsm-research/AdaptiveGuard to support further research.

Key Contributions

Demonstrates that static guardrails like LlamaGuard degrade to as low as 12% Defense Success Rate against unseen jailbreak attacks, motivating adaptive defenses
Proposes AdaptiveGuard, an OOD-aware continual learning framework that detects novel jailbreaks as out-of-distribution inputs and incrementally updates to defend against them
Empirical evaluation showing 96% OOD detection TPR, 100% DSR within a median of 2 update steps, and retention of ≥85% F1-score on in-distribution data post-adaptation

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

black_boxinference_time

Datasets

AIMDANSelf CipherDeep InceptionSmartGPTCode Chameleon jailbreak datasets

Applications

llm-powered softwareconversational aichatbots

Read PDF arXiv Code

AdaptiveGuard: Towards Adaptive Runtime Safety for LLM-Powered Software

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

PISanitizer: Preventing Prompt Injection to Long-Context LLMs via Prompt Sanitization

CodeGuard: Improving LLM Guardrails in CS Education

Securing AI Agents Against Prompt Injection Attacks

Mitigating Indirect Prompt Injection via Instruction-Following Intent Analysis

From static to adaptive: immune memory-based jailbreak detection for large language models

Knowing When Not to Answer: Lightweight KB-Aligned OOD Detection for Safe RAG

Defend LLMs Through Self-Consciousness

Prefix Probing: Lightweight Harmful Content Detection for Large Language Models