benchmark 2025

Efficient LLM Safety Evaluation through Multi-Agent Debate

Dachuan Lin 1,2,3,4, Guobin Shen 1,2,3, Zihao Yang 4, Tianrong Liu 4, Dongcheng Zhao 1,2,5, Yi Zeng 1,2,3

1 citations · 24 references · arXiv

α

Published on arXiv

2511.06396

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

SLM-based multi-agent debate judge achieves agreement comparable to GPT-4o judges on HAJailBench while reducing inference cost by approximately 43%.

Multi-Agent Judge

Novel technique introduced


Safety evaluation of large language models (LLMs) increasingly relies on LLM-as-a-Judge frameworks, but the high cost of frontier models limits scalability. We propose a cost-efficient multi-agent judging framework that employs Small Language Models (SLMs) through structured debates among critic, defender, and judge agents. To rigorously assess safety judgments, we construct HAJailBench, a large-scale human-annotated jailbreak benchmark comprising 12,000 adversarial interactions across diverse attack methods and target models. The dataset provides fine-grained, expert-labeled ground truth for evaluating both safety robustness and judge reliability. Our SLM-based framework achieves agreement comparable to GPT-4o judges on HAJailBench while substantially reducing inference cost. Ablation results show that three rounds of debate yield the optimal balance between accuracy and efficiency. These findings demonstrate that structured, value-aligned debate enables SLMs to capture semantic nuances of jailbreak attacks and that HAJailBench offers a reliable foundation for scalable LLM safety evaluation.


Key Contributions

  • HAJailBench: a large-scale human-annotated jailbreak benchmark with 12,000 adversarial interactions spanning diverse attack methods, target models (4B–614B parameters), and architectures
  • Multi-Agent Judge framework using structured debate among critic, defender, and judge SLM agents with a value-alignment stage across five safety dimensions
  • Demonstrates that SLM-based debate judges achieve near GPT-4o-level agreement while reducing inference cost by 43%, with three debate rounds as the optimal accuracy-efficiency tradeoff

🛡️ Threat Analysis


Details

Domains
nlp
Model Types
llm
Threat Tags
inference_timeblack_box
Datasets
HAJailBenchJBB-Behaviors
Applications
llm safety evaluationjailbreak detectionllm-as-a-judge