OpenGuardrails: A Configurable, Unified, and Scalable Guardrails Platform for Large Language Models
Thomas Wang 1, Haowen Li 2
Published on arXiv
2510.19169
Prompt Injection
OWASP LLM Top 10 — LLM01
Sensitive Information Disclosure
OWASP LLM Top 10 — LLM06
Key Finding
Achieves SOTA multilingual safety detection with a 3.3B quantized model retaining over 98% of 14B baseline accuracy across English, Chinese, and multilingual benchmarks
OpenGuardrails
Novel technique introduced
As large language models (LLMs) are increasingly integrated into real-world applications, ensuring their safety, robustness, and privacy compliance has become critical. We present OpenGuardrails, the first fully open-source platform that unifies large-model-based safety detection, manipulation defense, and deployable guardrail infrastructure. OpenGuardrails protects against three major classes of risks: (1) content-safety violations such as harmful or explicit text generation, (2) model-manipulation attacks including prompt injection, jailbreaks, and code-interpreter abuse, and (3) data leakage involving sensitive or private information. Unlike prior modular or rule-based frameworks, OpenGuardrails introduces three core innovations: (1) a Configurable Policy Adaptation mechanism that allows per-request customization of unsafe categories and sensitivity thresholds; (2) a Unified LLM-based Guard Architecture that performs both content-safety and manipulation detection within a single model; and (3) a Quantized, Scalable Model Design that compresses a 14B dense base model to 3.3B via GPTQ while preserving over 98 of benchmark accuracy. The system supports 119 languages, achieves state-of-the-art performance across multilingual safety benchmarks, and can be deployed as a secure gateway or API-based service for enterprise use. All models, datasets, and deployment scripts are released under the Apache 2.0 license.
Key Contributions
- Unified LLM-based guard architecture performing both content-safety and manipulation detection in a single 3.3B GPTQ-quantized model compressed from a 14B dense base
- Configurable Policy Adaptation mechanism allowing per-request customization of unsafe categories and sensitivity thresholds for enterprise deployment
- Fully open-source, production-ready platform with API/gateway deployment supporting 119 languages, achieving SOTA on multilingual safety benchmarks