defense 2025

DecipherGuard: Understanding and Deciphering Jailbreak Prompts for a Safer Deployment of Intelligent Software Systems

Rui Yang ¹, Michael Fu ², Chakkrit Tantithamthavorn ¹, Chetan Arora ¹, Gunel Gulmammadova ³, Joey Chua ³

¹ Monash University

² The University of Melbourne

³ Transurban

0 citations

Published on arXiv

2509.16870

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

DecipherGuard improves Defense Success Rate by 36%–65% and Overall Guardrail Performance by 20%–50% over LlamaGuard and two other state-of-the-art runtime guardrails.

DecipherGuard

Novel technique introduced

Intelligent software systems powered by Large Language Models (LLMs) are increasingly deployed in critical sectors, raising concerns about their safety during runtime. Through an industry-academic collaboration when deploying an LLM-powered virtual customer assistant, a critical software engineering challenge emerged: how to enhance a safer deployment of LLM-powered software systems at runtime? While LlamaGuard, the current state-of-the-art runtime guardrail, offers protection against unsafe inputs, our study reveals a Defense Success Rate (DSR) drop of 24% under obfuscation- and template-based jailbreak attacks. In this paper, we propose DecipherGuard, a novel framework that integrates a deciphering layer to counter obfuscation-based prompts and a low-rank adaptation mechanism to enhance guardrail effectiveness against template-based attacks. Empirical evaluation on over 22,000 prompts demonstrates that DecipherGuard improves DSR by 36% to 65% and Overall Guardrail Performance (OGP) by 20% to 50% compared to LlamaGuard and two other runtime guardrails. These results highlight the effectiveness of DecipherGuard in defending LLM-powered software systems against jailbreak attacks during runtime.

Key Contributions

A deciphering layer integrated into the guardrail pipeline to neutralize obfuscation-based jailbreak prompts before classification
Low-rank adaptation (LoRA) fine-tuning of LlamaGuard to improve robustness against template-based jailbreak attacks
The Overall Guardrail Performance (OGP) metric that jointly accounts for defense success rate and false alarm rate, enabling more realistic guardrail evaluation

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

inference_timeblack_box

Datasets

22,000+ prompts spanning 10 jailbreak attack types

Applications

virtual customer assistantsllm-powered software systems

Read PDF arXiv

DecipherGuard: Understanding and Deciphering Jailbreak Prompts for a Safer Deployment of Intelligent Software Systems

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

PISanitizer: Preventing Prompt Injection to Long-Context LLMs via Prompt Sanitization

CodeGuard: Improving LLM Guardrails in CS Education

Securing AI Agents Against Prompt Injection Attacks

Mitigating Indirect Prompt Injection via Instruction-Following Intent Analysis

From static to adaptive: immune memory-based jailbreak detection for large language models

Knowing When Not to Answer: Lightweight KB-Aligned OOD Detection for Safe RAG

Defend LLMs Through Self-Consciousness

Prefix Probing: Lightweight Harmful Content Detection for Large Language Models