tool 2025

Protect: Towards Robust Guardrailing Stack for Trustworthy Enterprise LLM Systems

Karthik Avinash , Nikhil Pareek , Rishav Hada

0 citations · 33 references · arXiv

α

Published on arXiv

2510.13351

Prompt Injection

OWASP LLM Top 10 — LLM01

Sensitive Information Disclosure

OWASP LLM Top 10 — LLM06

Key Finding

Protect surpasses WildGuard, LlamaGuard-4, and GPT-4.1 across all four safety dimensions in multi-modal evaluation while maintaining real-time latency suitable for enterprise deployment

Protect

Novel technique introduced


The increasing deployment of Large Language Models (LLMs) across enterprise and mission-critical domains has underscored the urgent need for robust guardrailing systems that ensure safety, reliability, and compliance. Existing solutions often struggle with real-time oversight, multi-modal data handling, and explainability -- limitations that hinder their adoption in regulated environments. Existing guardrails largely operate in isolation, focused on text alone making them inadequate for multi-modal, production-scale environments. We introduce Protect, natively multi-modal guardrailing model designed to operate seamlessly across text, image, and audio inputs, designed for enterprise-grade deployment. Protect integrates fine-tuned, category-specific adapters trained via Low-Rank Adaptation (LoRA) on an extensive, multi-modal dataset covering four safety dimensions: toxicity, sexism, data privacy, and prompt injection. Our teacher-assisted annotation pipeline leverages reasoning and explanation traces to generate high-fidelity, context-aware labels across modalities. Experimental results demonstrate state-of-the-art performance across all safety dimensions, surpassing existing open and proprietary models such as WildGuard, LlamaGuard-4, and GPT-4.1. Protect establishes a strong foundation for trustworthy, auditable, and production-ready safety systems capable of operating across text, image, and audio modalities.


Key Contributions

  • Natively multi-modal guardrailing model (Protect) operating across text, image, and audio using category-specific LoRA adapters on a unified Gemma-3n backbone
  • Teacher-assisted annotation pipeline leveraging reasoning and explanation traces for high-fidelity, context-aware safety labels across four dimensions: toxicity, sexism, data privacy, and prompt injection
  • State-of-the-art results surpassing WildGuard, LlamaGuard-4, and GPT-4.1 across all safety dimensions, with open-sourced text-modality models for reproducibility

🛡️ Threat Analysis


Details

Domains
nlpmultimodalaudiovision
Model Types
llmmultimodal
Threat Tags
inference_time
Datasets
Facebook Hateful MemesVizWiz-PrivWildGuardTestToxicChatToxiGenSafe-Guard-Prompt-Injection
Applications
enterprise llm deploymentcontent moderationvoice assistantsvisual document analysisagentic ai systems