benchmark 2026

Adversarial Humanities Benchmark: Results on Stylistic Robustness in Frontier Model Safety

Marcello Galisai ^1,2, Susanna Cifani ², Francesco Giarrusso ^1,2, Piercosma Bisconti ^1,2, Matteo Prandi ^1,2, Federico Pierucci ¹, Federico Sartore ^1,3, Daniele Nardi ²

¹ Sapienza University of Rome

² DEXAI

³ Sant’Anna School of Advanced Studies

0 citations

Published on arXiv

2604.18487

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Stylistically obfuscated prompts achieve 36.8-65.0% ASR (55.75% overall) across 31 frontier models compared to 3.84% for baseline harmful prompts, with CBRN as highest-risk category

Adversarial Humanities Benchmark

Novel technique introduced

The Adversarial Humanities Benchmark (AHB) evaluates whether model safety refusals survive a shift away from familiar harmful prompt forms. Starting from harmful tasks drawn from MLCommons AILuminate, the benchmark rewrites the same objectives through humanities-style transformations while preserving intent. This extends literature on Adversarial Poetry and Adversarial Tales from single jailbreak operators to a broader benchmark family of stylistic obfuscation and goal concealment. In the benchmark results reported here, the original attacks record 3.84% attack success rate (ASR), while transformed methods range from 36.8% to 65.0%, yielding 55.75% overall ASR across 31 frontier models. Under a European Union AI Act Code-of-Practice-inspired systemic-risk lens, Chemical, biological, radiological and nuclear (CBRN) is the highest bucket. Taken together, this lack of stylistic robustness suggests that current safety techniques suffer from weak generalization: deep understanding of 'non-maleficence' remains a central unresolved problem in frontier model safety.

Key Contributions

Adversarial Humanities Benchmark (AHB) - automated framework for generating stylistically obfuscated jailbreak prompts using literary/philosophical transformations
Evaluation across 31 frontier models showing 55.75% overall ASR (vs 3.84% baseline), demonstrating weak generalization of safety guardrails
Extends Adversarial Poetry and Adversarial Tales to a broader benchmark family measuring stylistic robustness gap in LLM safety

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

black_boxinference_timetargeted

Datasets

MLCommons AILuminate

Applications

llm safety evaluationred-teamingjailbreak detection

Read PDF arXiv Code

Adversarial Humanities Benchmark: Results on Stylistic Robustness in Frontier Model Safety

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Securing Large Language Models (LLMs) from Prompt Injection Attacks

How to Trick Your AI TA: A Systematic Study of Academic Jailbreaking in LLM Code Evaluation

Guarding the Guardrails: A Taxonomy-Driven Approach to Jailbreak Detection

Evaluation of Prompt Injection Defenses in Large Language Models

HarDBench: A Benchmark for Draft-Based Co-Authoring Jailbreak Attacks for Safe Human-LLM Collaborative Writing

Amazon Nova AI Challenge -- Trusted AI: Advancing secure, AI-assisted software development

Towards Reliable and Practical LLM Security Evaluations via Bayesian Modelling

AI Security Beyond Core Domains: Resume Screening as a Case Study of Adversarial Vulnerabilities in Specialized LLM Applications