attack 2025

Adversarial Poetry as a Universal Single-Turn Jailbreak Mechanism in Large Language Models

Piercosma Bisconti ^1,2, Matteo Prandi ^1,2, Federico Pierucci ^1,2, Francesco Giarrusso ^1,3, Marcantonio Bracale Syrnikov ², Marcello Galisai ^1,2, Vincenzo Suriani ², Olga Sorokoletova ^1,4, Federico Sartore ¹, Daniele Nardi ²

¹ DEXAI – Icaro Lab

² Sapienza University of Rome

³ Sant’Anna School of Advanced Studies

⁴ VU Amsterdam

9 citations · 1 influential · 21 references · arXiv

Published on arXiv

2511.15304

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Hand-crafted adversarial poems achieve 62% average jailbreak success rate across 25 frontier LLMs, exceeding 90% for some providers and up to 18x higher ASR than prose baselines on MLCommons prompts.

Adversarial Poetry Jailbreak

Novel technique introduced

We present evidence that adversarial poetry functions as a universal single-turn jailbreak technique for Large Language Models (LLMs). Across 25 frontier proprietary and open-weight models, curated poetic prompts yielded high attack-success rates (ASR), with some providers exceeding 90%. Mapping prompts to MLCommons and EU CoP risk taxonomies shows that poetic attacks transfer across CBRN, manipulation, cyber-offence, and loss-of-control domains. Converting 1,200 MLCommons harmful prompts into verse via a standardized meta-prompt produced ASRs up to 18 times higher than their prose baselines. Outputs are evaluated using an ensemble of 3 open-weight LLM judges, whose binary safety assessments were validated on a stratified human-labeled subset. Poetic framing achieved an average jailbreak success rate of 62% for hand-crafted poems and approximately 43% for meta-prompt conversions (compared to non-poetic baselines), substantially outperforming non-poetic baselines and revealing a systematic vulnerability across model families and safety training approaches. These findings demonstrate that stylistic variation alone can circumvent contemporary safety mechanisms, suggesting fundamental limitations in current alignment methods and evaluation protocols.

Key Contributions

Demonstrates that poetic stylistic reformulation alone constitutes a universal single-turn jailbreak achieving 62% average ASR across 25 frontier LLMs from 9 providers
Shows systematic scalability via a meta-prompt that converts 1,200 MLCommons harmful prompts into verse, achieving ASRs up to 18x higher than prose baselines
Maps poetic attack coverage to MLCommons and EU CoP risk taxonomies, revealing broad cross-domain attack surface spanning CBRN, manipulation, cyber-offense, and loss-of-control

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llm

Threat Tags

black_boxinference_timetargeted

Datasets

MLCommons AI Safety Benchmark (1200 harmful prompts)custom curated adversarial poems (20 hand-crafted)

Applications

llm safety systemsfrontier proprietary and open-weight llms

Read PDF arXiv DOI

Adversarial Poetry as a Universal Single-Turn Jailbreak Mechanism in Large Language Models

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

X-Teaming Evolutionary M2S: Automated Discovery of Multi-turn to Single-turn Jailbreak Templates

RoguePrompt: Dual-Layer Ciphering for Self-Reconstruction to Circumvent LLM Moderation

Safe2Harm: Semantic Isomorphism Attacks for Jailbreaking Large Language Models

"To Survive, I Must Defect": Jailbreaking LLMs via the Game-Theory Scenarios

ArtPerception: ASCII Art-based Jailbreak on LLMs with Recognition Pre-test

A Fragile Guardrail: Diffusion LLM's Safety Blessing and Its Failure Mode

Evolve the Method, Not the Prompts: Evolutionary Synthesis of Jailbreak Attacks on LLMs

AutoDAN-Reasoning: Enhancing Strategies Exploration based Jailbreak Attacks with Test-Time Scaling