tool 2025

SGuard-v1: Safety Guardrail for Large Language Models

JoonHo Lee , HyeonMin Cho , Jaewoong Yun , Hyunjae Lee , JunKyu Lee , Juree Seok

Samsung SDS

1 citations · 26 references · arXiv

Published on arXiv

2511.12497

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

SGuard-v1 achieves state-of-the-art safety performance across public and proprietary benchmarks while remaining lightweight enough for real-time deployment on a 2B-parameter model supporting 12 languages.

SGuard-v1

Novel technique introduced

We present SGuard-v1, a lightweight safety guardrail for Large Language Models (LLMs), which comprises two specialized models to detect harmful content and screen adversarial prompts in human-AI conversational settings. The first component, ContentFilter, is trained to identify safety risks in LLM prompts and responses in accordance with the MLCommons hazard taxonomy, a comprehensive framework for trust and safety assessment of AI. The second component, JailbreakFilter, is trained with a carefully designed curriculum over integrated datasets and findings from prior work on adversarial prompting, covering 60 major attack types while mitigating false-unsafe classification. SGuard-v1 is built on the 2B-parameter Granite-3.3-2B-Instruct model that supports 12 languages. We curate approximately 1.4 million training instances from both collected and synthesized data and perform instruction tuning on the base model, distributing the curated data across the two component according to their designated functions. Through extensive evaluation on public and proprietary safety benchmarks, SGuard-v1 achieves state-of-the-art safety performance while remaining lightweight, thereby reducing deployment overhead. SGuard-v1 also improves interpretability for downstream use by providing multi-class safety predictions and their binary confidence scores. We release the SGuard-v1 under the Apache-2.0 License to enable further research and practical deployment in AI safety.

Key Contributions

Dual-component guardrail system (ContentFilter for input/output moderation + JailbreakFilter for adversarial jailbreak detection) built on a lightweight 2B-parameter base model
Bilingual (English/Korean) training pipeline with ~1.4M curated and synthesized instances, covering 60 major jailbreak attack types with curriculum learning to reduce false positives
Open-source release under Apache-2.0 with multi-class safety predictions and binary confidence scores for interpretability

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

inference_timeblack_box

Datasets

MLCommons AILuminateproprietary safety benchmarks

Applications

llm content moderationjailbreak detectionconversational ai safety

Read PDF arXiv DOI Code

SGuard-v1: Safety Guardrail for Large Language Models

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Proactive Hardening of LLM Defenses with HASTE

NeuroBreak: Unveil Internal Jailbreak Mechanisms in Large Language Models

MindGuard: Guardrail Classifiers for Multi-Turn Mental Health Support

Taxonomy-Adaptive Moderation Model with Robust Guardrails for Large Language Models

Detecting Jailbreak Attempts in Clinical Training LLMs Through Automated Linguistic Feature Extraction

Quantifying Document Impact in RAG-LLMs

Learning-Based Automated Adversarial Red-Teaming for Robustness Evaluation of Large Language Models

How Toxic Can You Get? Search-based Toxicity Testing for Large Language Models