benchmark 2025

SceneJailEval: A Scenario-Adaptive Multi-Dimensional Framework for Jailbreak Evaluation

Lai Jiang ^1,2, Yuekang Li ³, Xiaohan Zhang ^1,2, Youtao Ding ^1,2, Li Pan ^1,2

¹ Shanghai Jiao Tong University

² Zhangjiang Institute for Advanced Study

³ University of New South Wales

0 citations

Published on arXiv

2508.06194

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Achieves F1 score of 0.917 on a full 14-scenario jailbreak dataset, +6% over SOTA, by adapting evaluation dimensions to each harm scenario rather than applying a single unified standard.

SceneJailEval

Novel technique introduced

Accurate jailbreak evaluation is critical for LLM red team testing and jailbreak research. Mainstream methods rely on binary classification (string matching, toxic text classifiers, and LLM-based methods), outputting only "yes/no" labels without quantifying harm severity. Emerged multi-dimensional frameworks (e.g., Security Violation, Relative Truthfulness and Informativeness) use unified evaluation standards across scenarios, leading to scenario-specific mismatches (e.g., "Relative Truthfulness" is irrelevant to "hate speech"), undermining evaluation accuracy. To address these, we propose SceneJailEval, with key contributions: (1) A pioneering scenario-adaptive multi-dimensional framework for jailbreak evaluation, overcoming the critical "one-size-fits-all" limitation of existing multi-dimensional methods, and boasting robust extensibility to seamlessly adapt to customized or emerging scenarios. (2) A novel 14-scenario dataset featuring rich jailbreak variants and regional cases, addressing the long-standing gap in high-quality, comprehensive benchmarks for scenario-adaptive evaluation. (3) SceneJailEval delivers state-of-the-art performance with an F1 score of 0.917 on our full-scenario dataset (+6% over SOTA) and 0.995 on JBB (+3% over SOTA), breaking through the accuracy bottleneck of existing evaluation methods in heterogeneous scenarios and solidifying its superiority.

Key Contributions

Scenario-adaptive multi-dimensional evaluation framework that dynamically selects relevant dimensions per harm scenario (e.g., skipping 'Relative Truthfulness' for hate speech), overcoming the one-size-fits-all limitation of prior methods
Novel 14-scenario dataset with jailbreak variants and regional cases, filling a gap in comprehensive scenario-adaptive evaluation benchmarks
SOTA F1 of 0.917 on full-scenario dataset (+6% over prior SOTA) and 0.995 on JailbreakBench (+3%), validated against human judgment labels

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

inference_time

Datasets

JBB (JailbreakBench)SceneJailEval 14-scenario dataset

Applications

llm safety evaluationjailbreak red-teamingllm security benchmarking

Read PDF arXiv Code

SceneJailEval: A Scenario-Adaptive Multi-Dimensional Framework for Jailbreak Evaluation

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

False Sense of Security: Why Probing-based Malicious Input Detection Fails to Generalize

The Struggle Between Continuation and Refusal: A Mechanistic Analysis of the Continuation-Triggered Jailbreak in LLMs

Defenses Against Prompt Attacks Learn Surface Heuristics

SecureBreak -- A dataset towards safe and secure models

Analysing the Safety Pitfalls of Steering Vectors

A Granular Study of Safety Pretraining under Model Abliteration

When Benchmarks Lie: Evaluating Malicious Prompt Classifiers Under True Distribution Shift

Security Assessment and Mitigation Strategies for Large Language Models: A Comprehensive Defensive Framework