defense 2025

AdaptiveGuard: Towards Adaptive Runtime Safety for LLM-Powered Software

Rui Yang 1, Michael Fu 2, Chakkrit Tantithamthavorn 1, Chetan Arora 1, Gunel Gulmammadova 3, Joey Chua 3

0 citations

α

Published on arXiv

2509.16861

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

AdaptiveGuard achieves 96% OOD detection accuracy and reaches 100% defense success rate within a median of 2 continual learning update steps, outperforming LlamaGuard which requires 4 steps while retaining 85% F1 on in-distribution data.

AdaptiveGuard

Novel technique introduced


Guardrails are critical for the safe deployment of Large Language Models (LLMs)-powered software. Unlike traditional rule-based systems with limited, predefined input-output spaces that inherently constrain unsafe behavior, LLMs enable open-ended, intelligent interactions--opening the door to jailbreak attacks through user inputs. Guardrails serve as a protective layer, filtering unsafe prompts before they reach the LLM. However, prior research shows that jailbreak attacks can still succeed over 70% of the time, even against advanced models like GPT-4o. While guardrails such as LlamaGuard report up to 95% accuracy, our preliminary analysis shows their performance can drop sharply--to as low as 12%--when confronted with unseen attacks. This highlights a growing software engineering challenge: how to build a post-deployment guardrail that adapts dynamically to emerging threats? To address this, we propose AdaptiveGuard, an adaptive guardrail that detects novel jailbreak attacks as out-of-distribution (OOD) inputs and learns to defend against them through a continual learning framework. Through empirical evaluation, AdaptiveGuard achieves 96% OOD detection accuracy, adapts to new attacks in just two update steps, and retains over 85% F1-score on in-distribution data post-adaptation, outperforming other baselines. These results demonstrate that AdaptiveGuard is a guardrail capable of evolving in response to emerging jailbreak strategies post deployment. We release our AdaptiveGuard and studied datasets at https://github.com/awsm-research/AdaptiveGuard to support further research.


Key Contributions

  • Demonstrates that static guardrails like LlamaGuard degrade to as low as 12% Defense Success Rate against unseen jailbreak attacks, motivating adaptive defenses
  • Proposes AdaptiveGuard, an OOD-aware continual learning framework that detects novel jailbreaks as out-of-distribution inputs and incrementally updates to defend against them
  • Empirical evaluation showing 96% OOD detection TPR, 100% DSR within a median of 2 update steps, and retention of ≥85% F1-score on in-distribution data post-adaptation

🛡️ Threat Analysis


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
black_boxinference_time
Datasets
AIMDANSelf CipherDeep InceptionSmartGPTCode Chameleon jailbreak datasets
Applications
llm-powered softwareconversational aichatbots