defense 2026

CourtGuard: A Model-Agnostic Framework for Zero-Shot Policy Adaptation in LLM Safety

Umid Suleymanov ¹, Rufiz Bayramov ², Suad Gafarli ², Seljan Musayeva ², Taghi Mammadov ², Aynur Akhundlu ², Murat Kantarcioglu ¹

¹ Virginia Tech

² ADA University

0 citations

Published on arXiv

2602.22557

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Outperforms fine-tuned policy-following baselines across 7 safety benchmarks and achieves 90% accuracy on an out-of-domain Wikipedia Vandalism task via zero-shot policy swapping

CourtGuard (Evidentiary Debate)

Novel technique introduced

Current safety mechanisms for Large Language Models (LLMs) rely heavily on static, fine-tuned classifiers that suffer from adaptation rigidity, the inability to enforce new governance rules without expensive retraining. To address this, we introduce CourtGuard, a retrieval-augmented multi-agent framework that reimagines safety evaluation as Evidentiary Debate. By orchestrating an adversarial debate grounded in external policy documents, CourtGuard achieves state-of-the-art performance across 7 safety benchmarks, outperforming dedicated policy-following baselines without fine-tuning. Beyond standard metrics, we highlight two critical capabilities: (1) Zero-Shot Adaptability, where our framework successfully generalized to an out-of-domain Wikipedia Vandalism task (achieving 90\% accuracy) by swapping the reference policy; and (2) Automated Data Curation and Auditing, where we leveraged CourtGuard to curate and audit nine novel datasets of sophisticated adversarial attacks. Our results demonstrate that decoupling safety logic from model weights offers a robust, interpretable, and adaptable path for meeting current and future regulatory requirements in AI governance.

Key Contributions

CourtGuard: a retrieval-augmented multi-agent framework that frames LLM safety as Evidentiary Debate grounded in external policy documents, achieving SOTA on 7 safety benchmarks without fine-tuning
Zero-shot policy adaptability via document swapping, demonstrated by generalizing to out-of-domain Wikipedia Vandalism detection at 90% accuracy
Automated curation and auditing of nine novel adversarial attack datasets using the framework itself

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

black_boxinference_time

Datasets

Wikipedia Vandalism dataset

Applications

llm safetycontent moderationai governanceadversarial attack detection

Read PDF arXiv

CourtGuard: A Model-Agnostic Framework for Zero-Shot Policy Adaptation in LLM Safety

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

PISanitizer: Preventing Prompt Injection to Long-Context LLMs via Prompt Sanitization

Jailbreaking Leaves a Trace: Understanding and Detecting Jailbreak Attacks from Internal Representations of Large Language Models

Securing AI Agents Against Prompt Injection Attacks

PromptSleuth: Detecting Prompt Injection via Semantic Intent Invariance

From static to adaptive: immune memory-based jailbreak detection for large language models

Knowing When Not to Answer: Lightweight KB-Aligned OOD Detection for Safe RAG

Defend LLMs Through Self-Consciousness

Prefix Probing: Lightweight Harmful Content Detection for Large Language Models