benchmark 2026

RedBench: A Universal Dataset for Comprehensive Red Teaming of Large Language Models

Quy-Anh Dang 1,2, Chris Ngo 2, Truong-Son Hy 3

0 citations · 52 references · arXiv

α

Published on arXiv

2601.03699

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Aggregating 37 datasets under a standardized 22-category taxonomy exposes inconsistencies in prior red-teaming evaluations and enables systematic LLM vulnerability baselines across 19 domains.

RedBench

Novel technique introduced


As large language models (LLMs) become integral to safety-critical applications, ensuring their robustness against adversarial prompts is paramount. However, existing red teaming datasets suffer from inconsistent risk categorizations, limited domain coverage, and outdated evaluations, hindering systematic vulnerability assessments. To address these challenges, we introduce RedBench, a universal dataset aggregating 37 benchmark datasets from leading conferences and repositories, comprising 29,362 samples across attack and refusal prompts. RedBench employs a standardized taxonomy with 22 risk categories and 19 domains, enabling consistent and comprehensive evaluations of LLM vulnerabilities. We provide a detailed analysis of existing datasets, establish baselines for modern LLMs, and open-source the dataset and evaluation code. Our contributions facilitate robust comparisons, foster future research, and promote the development of secure and reliable LLMs for real-world deployment. Code: https://github.com/knoveleng/redeval


Key Contributions

  • RedBench: a unified dataset aggregating 37 red-teaming benchmark datasets (29,362 samples) into a single standardized resource
  • Standardized taxonomy with 22 risk categories and 19 domains to enable consistent cross-dataset LLM vulnerability assessments
  • Baseline evaluations of modern LLMs against the unified benchmark with open-sourced dataset and evaluation code

🛡️ Threat Analysis


Details

Domains
nlp
Model Types
llm
Threat Tags
inference_timeblack_box
Datasets
RedBench (37 aggregated datasets, 29,362 samples)
Applications
llm safety evaluationred-teamingvulnerability assessment