tool 2025

OpenGuardrails: A Configurable, Unified, and Scalable Guardrails Platform for Large Language Models

Thomas Wang 1, Haowen Li 2

0 citations · 14 references · arXiv

α

Published on arXiv

2510.19169

Prompt Injection

OWASP LLM Top 10 — LLM01

Sensitive Information Disclosure

OWASP LLM Top 10 — LLM06

Key Finding

Achieves SOTA multilingual safety detection with a 3.3B quantized model retaining over 98% of 14B baseline accuracy across English, Chinese, and multilingual benchmarks

OpenGuardrails

Novel technique introduced


As large language models (LLMs) are increasingly integrated into real-world applications, ensuring their safety, robustness, and privacy compliance has become critical. We present OpenGuardrails, the first fully open-source platform that unifies large-model-based safety detection, manipulation defense, and deployable guardrail infrastructure. OpenGuardrails protects against three major classes of risks: (1) content-safety violations such as harmful or explicit text generation, (2) model-manipulation attacks including prompt injection, jailbreaks, and code-interpreter abuse, and (3) data leakage involving sensitive or private information. Unlike prior modular or rule-based frameworks, OpenGuardrails introduces three core innovations: (1) a Configurable Policy Adaptation mechanism that allows per-request customization of unsafe categories and sensitivity thresholds; (2) a Unified LLM-based Guard Architecture that performs both content-safety and manipulation detection within a single model; and (3) a Quantized, Scalable Model Design that compresses a 14B dense base model to 3.3B via GPTQ while preserving over 98 of benchmark accuracy. The system supports 119 languages, achieves state-of-the-art performance across multilingual safety benchmarks, and can be deployed as a secure gateway or API-based service for enterprise use. All models, datasets, and deployment scripts are released under the Apache 2.0 license.


Key Contributions

  • Unified LLM-based guard architecture performing both content-safety and manipulation detection in a single 3.3B GPTQ-quantized model compressed from a 14B dense base
  • Configurable Policy Adaptation mechanism allowing per-request customization of unsafe categories and sensitivity thresholds for enterprise deployment
  • Fully open-source, production-ready platform with API/gateway deployment supporting 119 languages, achieving SOTA on multilingual safety benchmarks

🛡️ Threat Analysis


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
inference_timeblack_box
Datasets
ToxicChatWildGuardMixPolyGuardXSTestBeaverTailsOpenAIModerationOpenGuardrailsMixZh_97k
Applications
llm safetyenterprise llm deploymentprompt injection defensecontent moderation