attack 2025

From Adversarial Poetry to Adversarial Tales: An Interpretability Research Agenda

Piercosma Bisconti ^1,2, Marcello Galisai ^1,2, Matteo Prandi ^1,2, Federico Pierucci ^1,2, Olga Sorokoletova ², Francesco Giarrusso ^1,3, Vincenzo Suriani ^1,2, Marcantonio Bracale Syrnikov ^1,2, Daniele Nardi ²

¹ Sapienza University of Rome

² VU Amsterdam

³ Sant’Anna School of Advanced Studies

1 citations · 18 references · arXiv

Published on arXiv

2601.08837

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Achieves 71.3% average attack success rate across 26 frontier models in single-turn attacks with no iterative adaptation, ranging from 35% (Claude Haiku 4.5) to 94% (Qwen3 Max)

Adversarial Tales

Novel technique introduced

Safety mechanisms in LLMs remain vulnerable to attacks that reframe harmful requests through culturally coded structures. We introduce Adversarial Tales, a jailbreak technique that embeds harmful content within cyberpunk narratives and prompts models to perform functional analysis inspired by Vladimir Propp's morphology of folktales. By casting the task as structural decomposition, the attack induces models to reconstruct harmful procedures as legitimate narrative interpretation. Across 26 frontier models from nine providers, we observe an average attack success rate of 71.3%, with no model family proving reliably robust. Together with our prior work on Adversarial Poetry, these findings suggest that structurally-grounded jailbreaks constitute a broad vulnerability class rather than isolated techniques. The space of culturally coded frames that can mediate harmful intent is vast, likely inexhaustible by pattern-matching defenses alone. Understanding why these attacks succeed is therefore essential: we outline a mechanistic interpretability research agenda to investigate how narrative cues reshape model representations and whether models can learn to recognize harmful intent independently of surface form.

Key Contributions

Introduces Adversarial Tales, a single-turn jailbreak that embeds harmful procedures within cyberpunk narratives and exploits Proppian structural analysis framing to elicit harmful outputs from LLMs
Demonstrates 71.3% average attack success rate across 26 frontier models from 9 providers with no model family proving reliably robust (range: 35%–94%)
Proposes a mechanistic interpretability research agenda to explain why narrative and structural cues systematically weaken safety constraints, framing structurally-grounded jailbreaks as a broad vulnerability class

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

black_boxinference_timetargeted

Datasets

40 manually curated adversarial cyberpunk tales26 frontier LLMs (Google, OpenAI, Anthropic, Deepseek, Qwen, Mistral, Meta, xAI, Moonshot)

Applications

llm safety alignmentfrontier language models

Read PDF arXiv DOI

From Adversarial Poetry to Adversarial Tales: An Interpretability Research Agenda

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Align to Misalign: Automatic LLM Jailbreak with Meta-Optimized LLM Judges

Emoji-Based Jailbreaking of Large Language Models

In-Context Representation Hijacking

Trojan Horses in Recruiting: A Red-Teaming Case Study on Indirect Prompt Injection in Standard vs. Reasoning Models

Structured Semantic Cloaking for Jailbreak Attacks on Large Language Models

Can You Trick the Grader? Adversarial Persuasion of LLM Judges

Persona Jailbreaking in Large Language Models

MetaBreak: Jailbreaking Online LLM Services via Special Token Manipulation