defense 2026

ExpGuard: LLM Content Moderation in Specialized Domains

Minseok Choi ^1,2, Dongjin Kim ^1,2, Seungbin Yang ^1,2, Subin Kim ², Youngjun Kwak ², Juyoung Oh ², Jaegul Choo ¹, Jungmin Son ²

¹ KAIST

² KakaoBank

0 citations

Published on arXiv

2603.02588

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

ExpGuard surpasses WildGuard by up to 8.9% in prompt classification and 15.3% in response classification on domain-specific adversarial content across financial, medical, and legal sectors.

ExpGuard

Novel technique introduced

With the growing deployment of large language models (LLMs) in real-world applications, establishing robust safety guardrails to moderate their inputs and outputs has become essential to ensure adherence to safety policies. Current guardrail models predominantly address general human-LLM interactions, rendering LLMs vulnerable to harmful and adversarial content within domain-specific contexts, particularly those rich in technical jargon and specialized concepts. To address this limitation, we introduce ExpGuard, a robust and specialized guardrail model designed to protect against harmful prompts and responses across financial, medical, and legal domains. In addition, we present ExpGuardMix, a meticulously curated dataset comprising 58,928 labeled prompts paired with corresponding refusal and compliant responses, from these specific sectors. This dataset is divided into two subsets: ExpGuardTrain, for model training, and ExpGuardTest, a high-quality test set annotated by domain experts to evaluate model robustness against technical and domain-specific content. Comprehensive evaluations conducted on ExpGuardTest and eight established public benchmarks reveal that ExpGuard delivers competitive performance across the board while demonstrating exceptional resilience to domain-specific adversarial attacks, surpassing state-of-the-art models such as WildGuard by up to 8.9% in prompt classification and 15.3% in response classification. To encourage further research and development, we open-source our code, data, and model, enabling adaptation to additional domains and supporting the creation of increasingly robust guardrail models.

Key Contributions

ExpGuard: a domain-specialized guardrail model for harmful prompt/response classification in financial, medical, and legal domains
ExpGuardMix dataset: 58,928 labeled prompts with refusal/compliant responses across three specialized domains, including an expert-annotated test set
Demonstrated up to 8.9% improvement over WildGuard on prompt classification and 15.3% on response classification on domain-specific adversarial content

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

inference_timeblack_box

Datasets

ExpGuardTestExpGuardTrainWildGuard benchmarkeight public safety benchmarks

Applications

llm content moderationfinancial domain safetymedical domain safetylegal domain safety

Read PDF arXiv

ExpGuard: LLM Content Moderation in Specialized Domains

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Alignment-Weighted DPO: A principled reasoning approach to improve safety alignment

Jailbreaking Leaves a Trace: Understanding and Detecting Jailbreak Attacks from Internal Representations of Large Language Models

Securing AI Agents Against Prompt Injection Attacks

PromptSleuth: Detecting Prompt Injection via Semantic Intent Invariance

From static to adaptive: immune memory-based jailbreak detection for large language models

Knowing When Not to Answer: Lightweight KB-Aligned OOD Detection for Safe RAG

Defend LLMs Through Self-Consciousness

Prefix Probing: Lightweight Harmful Content Detection for Large Language Models