tool 2025

OpenGuardrails: A Configurable, Unified, and Scalable Guardrails Platform for Large Language Models

Thomas Wang ¹, Haowen Li ²

¹ OpenGuardrails.com

² The Hong Kong Polytechnic University

0 citations · 14 references · arXiv

Published on arXiv

2510.19169

Prompt Injection

OWASP LLM Top 10 — LLM01

Sensitive Information Disclosure

OWASP LLM Top 10 — LLM06

Key Finding

Achieves SOTA multilingual safety detection with a 3.3B quantized model retaining over 98% of 14B baseline accuracy across English, Chinese, and multilingual benchmarks

OpenGuardrails

Novel technique introduced

As large language models (LLMs) are increasingly integrated into real-world applications, ensuring their safety, robustness, and privacy compliance has become critical. We present OpenGuardrails, the first fully open-source platform that unifies large-model-based safety detection, manipulation defense, and deployable guardrail infrastructure. OpenGuardrails protects against three major classes of risks: (1) content-safety violations such as harmful or explicit text generation, (2) model-manipulation attacks including prompt injection, jailbreaks, and code-interpreter abuse, and (3) data leakage involving sensitive or private information. Unlike prior modular or rule-based frameworks, OpenGuardrails introduces three core innovations: (1) a Configurable Policy Adaptation mechanism that allows per-request customization of unsafe categories and sensitivity thresholds; (2) a Unified LLM-based Guard Architecture that performs both content-safety and manipulation detection within a single model; and (3) a Quantized, Scalable Model Design that compresses a 14B dense base model to 3.3B via GPTQ while preserving over 98 of benchmark accuracy. The system supports 119 languages, achieves state-of-the-art performance across multilingual safety benchmarks, and can be deployed as a secure gateway or API-based service for enterprise use. All models, datasets, and deployment scripts are released under the Apache 2.0 license.

Key Contributions

Unified LLM-based guard architecture performing both content-safety and manipulation detection in a single 3.3B GPTQ-quantized model compressed from a 14B dense base
Configurable Policy Adaptation mechanism allowing per-request customization of unsafe categories and sensitivity thresholds for enterprise deployment
Fully open-source, production-ready platform with API/gateway deployment supporting 119 languages, achieving SOTA on multilingual safety benchmarks

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

inference_timeblack_box

Datasets

ToxicChatWildGuardMixPolyGuardXSTestBeaverTailsOpenAIModerationOpenGuardrailsMixZh_97k

Applications

llm safetyenterprise llm deploymentprompt injection defensecontent moderation

Read PDF arXiv DOI Code

OpenGuardrails: A Configurable, Unified, and Scalable Guardrails Platform for Large Language Models

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Adaptive Backtracking for Privacy Protection in Large Language Models

SAFENLIDB: A Privacy-Preserving Safety Alignment Framework for LLM-based Natural Language Database Interfaces

NeuroFilter: Privacy Guardrails for Conversational LLM Agents

Searching for Privacy Risks in LLM Agents via Simulation

Activation-Space Anchored Access Control for Multi-Class Permission Reasoning in Large Language Models

Benchmarking LLAMA Model Security Against OWASP Top 10 For LLM Applications

DUET: Distilled LLM Unlearning from an Efficiently Contextualized Teacher

Eliciting Secret Knowledge from Language Models