benchmark 2026

Adversarial Humanities Benchmark: Results on Stylistic Robustness in Frontier Model Safety

Marcello Galisai 1,2, Susanna Cifani 2, Francesco Giarrusso 1,2, Piercosma Bisconti 1,2, Matteo Prandi 1,2, Federico Pierucci 1, Federico Sartore 1,3, Daniele Nardi 2

0 citations

α

Published on arXiv

2604.18487

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Stylistically obfuscated prompts achieve 36.8-65.0% ASR (55.75% overall) across 31 frontier models compared to 3.84% for baseline harmful prompts, with CBRN as highest-risk category

Adversarial Humanities Benchmark

Novel technique introduced


The Adversarial Humanities Benchmark (AHB) evaluates whether model safety refusals survive a shift away from familiar harmful prompt forms. Starting from harmful tasks drawn from MLCommons AILuminate, the benchmark rewrites the same objectives through humanities-style transformations while preserving intent. This extends literature on Adversarial Poetry and Adversarial Tales from single jailbreak operators to a broader benchmark family of stylistic obfuscation and goal concealment. In the benchmark results reported here, the original attacks record 3.84% attack success rate (ASR), while transformed methods range from 36.8% to 65.0%, yielding 55.75% overall ASR across 31 frontier models. Under a European Union AI Act Code-of-Practice-inspired systemic-risk lens, Chemical, biological, radiological and nuclear (CBRN) is the highest bucket. Taken together, this lack of stylistic robustness suggests that current safety techniques suffer from weak generalization: deep understanding of 'non-maleficence' remains a central unresolved problem in frontier model safety.


Key Contributions

  • Adversarial Humanities Benchmark (AHB) - automated framework for generating stylistically obfuscated jailbreak prompts using literary/philosophical transformations
  • Evaluation across 31 frontier models showing 55.75% overall ASR (vs 3.84% baseline), demonstrating weak generalization of safety guardrails
  • Extends Adversarial Poetry and Adversarial Tales to a broader benchmark family measuring stylistic robustness gap in LLM safety

🛡️ Threat Analysis


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
black_boxinference_timetargeted
Datasets
MLCommons AILuminate
Applications
llm safety evaluationred-teamingjailbreak detection