Prompt Attack Detection with LLM-as-a-Judge and Mixture-of-Models

Prompt attacks, including jailbreaks and prompt injections, pose a critical security risk to Large Language Model (LLM) systems. In production, guardrails must mitigate these attacks under strict low-latency constraints, resulting in a deployment gap in which lightweight classifiers and rule-based systems struggle to generalize under distribution shift, while high-capacity LLM-based judges remain too slow or costly for live enforcement. In this work, we examine whether lightweight, general-purpose LLMs can reliably serve as security judges under real-world production constraints. Through careful prompt and output design, lightweight LLMs are guided through a structured reasoning process involving explicit intent decomposition, safety-signal verification, harm assessment, and self-reflection. We evaluate our method on a curated dataset combining benign queries from real-world chatbots with adversarial prompts generated via automated red teaming (ART), covering diverse and evolving patterns. Our results show that general-purpose LLMs, such as gemini-2.0-flash-lite-001, can serve as effective low-latency judges for live guardrails. This configuration is currently deployed in production as a centralized guardrail service for public service chatbots in Singapore. We additionally evaluate a Mixture-of-Models (MoM) setting to assess whether aggregating multiple LLM judges improves prompt-attack detection performance relative to single-model judges, with only modest gains observed.

Key Contributions

LLM-as-a-Judge approach with structured reasoning (intent decomposition, safety verification, harm assessment, self-reflection) for real-time prompt attack detection
Evaluation dataset combining benign queries from production chatbots with adversarial prompts from automated red teaming
Mixture-of-Models (MoM) evaluation showing modest gains from aggregating multiple LLM judges
Production deployment as centralized guardrail service for Singapore public service chatbots

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llm

Threat Tags

inference_timeblack_box

Datasets

Custom dataset: benign queries from Singapore public service chatbots + automated red teaming (ART) generated attacks

Applications

chatbot guardrailsprompt injection detectionjailbreak detection

2025 0 cit.

100%