tool 2025

GUARD: Guideline Upholding Test through Adaptive Role-play and Jailbreak Diagnostics for LLMs

Haibo Jin ¹, Ruoxi Chen ², Peiyan Zhang ³, Andy Zhou ¹, Haohan Wang ¹

¹ University of Illinois at Urbana-Champaign

² Starc Institute

³ Hong Kong University of Science and Technology

0 citations

Published on arXiv

2508.20325

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

GUARD-JD successfully identifies jailbreak scenarios that bypass safety mechanisms across Vicuna-13B, LongChat-7B, Llama-series, GPT-3.5/4/4o, and Claude-3.7, and transfers to VLMs (MiniGPT-v2, Gemini-1.5)

GUARD / GUARD-JD

Novel technique introduced

As Large Language Models become increasingly integral to various domains, their potential to generate harmful responses has prompted significant societal and regulatory concerns. In response, governments have issued ethics guidelines to promote the development of trustworthy AI. However, these guidelines are typically high-level demands for developers and testers, leaving a gap in translating them into actionable testing questions to verify LLM compliance. To address this challenge, we introduce GUARD (\textbf{G}uideline \textbf{U}pholding Test through \textbf{A}daptive \textbf{R}ole-play and Jailbreak \textbf{D}iagnostics), a testing method designed to operationalize guidelines into specific guideline-violating questions that assess LLM adherence. To implement this, GUARD uses automated generation of guideline-violating questions based on government-issued guidelines, thereby testing whether responses comply with these guidelines. When responses directly violate guidelines, GUARD reports inconsistencies. Furthermore, for responses that do not directly violate guidelines, GUARD integrates the concept of ``jailbreaks'' to diagnostics, named GUARD-JD, which creates scenarios that provoke unethical or guideline-violating responses, effectively identifying potential scenarios that could bypass built-in safety mechanisms. Our method finally culminates in a compliance report, delineating the extent of adherence and highlighting any violations. We have empirically validated the effectiveness of GUARD on seven LLMs, including Vicuna-13B, LongChat-7B, Llama2-7B, Llama-3-8B, GPT-3.5, GPT-4, GPT-4o, and Claude-3.7, by testing compliance under three government-issued guidelines and conducting jailbreak diagnostics. Additionally, GUARD-JD can transfer jailbreak diagnostics to vision-language models, demonstrating its usage in promoting reliable LLM-based applications.

Key Contributions

Automated pipeline that translates government AI ethics guidelines into actionable guideline-violating test questions for LLMs
GUARD-JD: jailbreak diagnostics module using adaptive role-play scenarios to surface safety bypasses in LLMs and VLMs
Compliance reporting framework validated across 8 LLMs (including GPT-4o, Claude-3.7) and 2 VLMs under 3 government-issued AI ethics guidelines

🛡️ Threat Analysis

Details

Domains

nlpmultimodal

Model Types

llmvlm

Threat Tags

black_boxinference_time

Datasets

EU Ethics Guidelines for Trustworthy AIgovernment-issued AI guidelines (3 total)

Applications

llm safety testingai compliance auditingred-teaming

Read PDF arXiv

GUARD: Guideline Upholding Test through Adaptive Role-play and Jailbreak Diagnostics for LLMs

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

MUSE: A Run-Centric Platform for Multimodal Unified Safety Evaluation of Large Language Models

AM$^3$Safety: Towards Data Efficient Alignment of Multi-modal Multi-turn Safety for MLLMs

Alignment Drift in Multimodal LLMs: A Two-Phase, Longitudinal Evaluation of Harm Across Eight Model Releases

Clouding the Mirror: Stealthy Prompt Injection Attacks Targeting LLM-based Phishing Detection

Multi-turn Jailbreaking Attack in Multi-Modal Large Language Models

ICON: Indirect Prompt Injection Defense for Agents based on Inference-Time Correction

Red Teaming Multimodal Language Models: Evaluating Harm Across Prompt Modalities and Models

Gaming the Judge: Unfaithful Chain-of-Thought Can Undermine Agent Evaluation