benchmark 2025

Efficient LLM Safety Evaluation through Multi-Agent Debate

1 citations · 24 references · arXiv

Published on arXiv

2511.06396

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

SLM-based multi-agent debate judge achieves agreement comparable to GPT-4o judges on HAJailBench while reducing inference cost by approximately 43%.

Multi-Agent Judge

Novel technique introduced

Safety evaluation of large language models (LLMs) increasingly relies on LLM-as-a-Judge frameworks, but the high cost of frontier models limits scalability. We propose a cost-efficient multi-agent judging framework that employs Small Language Models (SLMs) through structured debates among critic, defender, and judge agents. To rigorously assess safety judgments, we construct HAJailBench, a large-scale human-annotated jailbreak benchmark comprising 12,000 adversarial interactions across diverse attack methods and target models. The dataset provides fine-grained, expert-labeled ground truth for evaluating both safety robustness and judge reliability. Our SLM-based framework achieves agreement comparable to GPT-4o judges on HAJailBench while substantially reducing inference cost. Ablation results show that three rounds of debate yield the optimal balance between accuracy and efficiency. These findings demonstrate that structured, value-aligned debate enables SLMs to capture semantic nuances of jailbreak attacks and that HAJailBench offers a reliable foundation for scalable LLM safety evaluation.

Key Contributions

HAJailBench: a large-scale human-annotated jailbreak benchmark with 12,000 adversarial interactions spanning diverse attack methods, target models (4B–614B parameters), and architectures
Multi-Agent Judge framework using structured debate among critic, defender, and judge SLM agents with a value-alignment stage across five safety dimensions
Demonstrates that SLM-based debate judges achieve near GPT-4o-level agreement while reducing inference cost by 43%, with three debate rounds as the optimal accuracy-efficiency tradeoff

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llm

Threat Tags

inference_timeblack_box

Datasets

HAJailBenchJBB-Behaviors

Applications

llm safety evaluationjailbreak detectionllm-as-a-judge

Read PDF arXiv DOI

Efficient LLM Safety Evaluation through Multi-Agent Debate

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Beyond Fixed and Dynamic Prompts: Embedded Jailbreak Templates for Advancing LLM Security

RedBench: A Universal Dataset for Comprehensive Red Teaming of Large Language Models

Quantifying CBRN Risk in Frontier Models

Understanding LLM Behavior When Encountering User-Supplied Harmful Content in Harmless Tasks

MalURLBench: A Benchmark Evaluating Agents' Vulnerabilities When Processing Web URLs

When Your Reviewer is an LLM: Biases, Divergence, and Prompt Injection Risks in Peer Review

Read the Scene, Not the Script: Outcome-Aware Safety for LLMs

SocialHarmBench: Revealing LLM Vulnerabilities to Socially Harmful Requests