benchmark 2025

ObjexMT: Objective Extraction and Metacognitive Calibration for LLM-as-a-Judge under Multi-Turn Jailbreaks

0 citations

Published on arXiv

2508.16889

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

kimi-k2 leads objective-extraction accuracy (0.612) while claude-sonnet-4 achieves best calibration (ECE 0.206, AURC 0.242), but all models exhibit persistent high-confidence errors (Wrong@0.90 up to 47.7%), exposing critical reliability gaps in LLM judges under multi-turn jailbreaks.

ObjexMT

Novel technique introduced

LLM-as-a-Judge (LLMaaJ) enables scalable evaluation, yet we lack a decisive test of a judge's qualification: can it recover the hidden objective of a conversation and know when that inference is reliable? Large language models degrade with irrelevant or lengthy context, and multi-turn jailbreaks can scatter goals across turns. We present ObjexMT, a benchmark for objective extraction and metacognition. Given a multi-turn transcript, a model must output a one-sentence base objective and a self-reported confidence. Accuracy is scored by semantic similarity to gold objectives, then thresholded once on 300 calibration items ($τ^\star = 0.66$; $F_1@τ^\star = 0.891$). Metacognition is assessed with expected calibration error, Brier score, Wrong@High-Confidence (0.80 / 0.90 / 0.95), and risk--coverage curves. Across six models (gpt-4.1, claude-sonnet-4, Qwen3-235B-A22B-FP8, kimi-k2, deepseek-v3.1, gemini-2.5-flash) evaluated on SafeMTData\_Attack600, SafeMTData\_1K, and MHJ, kimi-k2 achieves the highest objective-extraction accuracy (0.612; 95\% CI [0.594, 0.630]), while claude-sonnet-4 (0.603) and deepseek-v3.1 (0.599) are statistically tied. claude-sonnet-4 offers the best selective risk and calibration (AURC 0.242; ECE 0.206; Brier 0.254). Performance varies sharply across datasets (16--82\% accuracy), showing that automated obfuscation imposes challenges beyond model choice. High-confidence errors remain: Wrong@0.90 ranges from 14.9\% (claude-sonnet-4) to 47.7\% (Qwen3-235B-A22B-FP8). ObjexMT therefore supplies an actionable test for LLM judges: when objectives are implicit, judges often misinfer them; exposing objectives or gating decisions by confidence is advisable. All experimental data are in the Supplementary Material and at https://github.com/hyunjun1121/ObjexMT_dataset.

Key Contributions

ObjexMT benchmark formalizing objective extraction and metacognitive calibration for LLM-as-a-Judge systems on multi-turn jailbreak transcripts
Evaluation framework combining semantic similarity scoring with calibration metrics (ECE, Brier, Wrong@High-Confidence, risk-coverage curves) across six frontier LLMs on three datasets
Finding that automated obfuscation causes 16–82% accuracy variance across datasets, and high-confidence errors persist (Wrong@0.90 14.9%–47.7%), revealing fundamental limits of current LLM judges

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llm

Threat Tags

inference_timeblack_box

Datasets

SafeMTData_Attack600SafeMTData_1KMHJ

Applications

llm-as-a-judge safety evaluationmulti-turn jailbreak detectioncontent moderation

Read PDF arXiv Code

ObjexMT: Objective Extraction and Metacognitive Calibration for LLM-as-a-Judge under Multi-Turn Jailbreaks

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

LJ-Bench: Ontology-Based Benchmark for U.S. Crime

Prompt Injection Evaluations: Refusal Boundary Instability and Artifact-Dependent Compliance in GPT-4-Series Models

Gaming the Answer Matcher: Examining the Impact of Text Manipulation on Automated Judgment

Quantifying CBRN Risk in Frontier Models

MalURLBench: A Benchmark Evaluating Agents' Vulnerabilities When Processing Web URLs

CySecBench: Generative AI-based CyberSecurity-focused Prompt Dataset for Benchmarking Large Language Models

Vulnerability of LLMs' Belief Systems? LLMs Belief Resistance Check Through Strategic Persuasive Conversation Interventions

Cooking Up Risks: Benchmarking and Reducing Food Safety Risks in Large Language Models