benchmark 2025

Say It Differently: Linguistic Styles as Jailbreak Vectors

Srikant Panda 1,2, Avinash Rai 2

1 citations · arXiv

α

Published on arXiv

2511.10519

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Stylistic reframing of harmful prompts (particularly fearful, curious, and compassionate styles) increases jailbreak success rates by up to 57 percentage points across 16 LLMs, with contextualized LLM rewrites outperforming rigid templates.

Style Neutralization

Novel technique introduced


Large Language Models (LLMs) are commonly evaluated for robustness against paraphrased or semantically equivalent jailbreak prompts, yet little attention has been paid to linguistic variation as an attack surface. In this work, we systematically study how linguistic styles such as fear or curiosity can reframe harmful intent and elicit unsafe responses from aligned models. We construct style-augmented jailbreak benchmark by transforming prompts from 3 standard datasets into 11 distinct linguistic styles using handcrafted templates and LLM-based rewrites, while preserving semantic intent. Evaluating 16 open- and close-source instruction-tuned models, we find that stylistic reframing increases jailbreak success rates by up to +57 percentage points. Styles such as fearful, curious and compassionate are most effective and contextualized rewrites outperform templated variants. To mitigate this, we introduce a style neutralization preprocessing step using a secondary LLM to strip manipulative stylistic cues from user inputs, significantly reducing jailbreak success rates. Our findings reveal a systemic and scaling-resistant vulnerability overlooked in current safety pipelines.


Key Contributions

  • Style-augmented jailbreak benchmark: 11 linguistic styles (fearful, curious, compassionate, etc.) applied to prompts from 3 standard datasets via handcrafted templates and LLM-based contextualized rewrites
  • Large-scale evaluation across 16 open- and closed-source instruction-tuned LLMs, showing stylistic reframing increases jailbreak success rates by up to +57 percentage points — a scaling-resistant vulnerability
  • Style neutralization preprocessing defense: a secondary LLM strips manipulative stylistic cues from user inputs, significantly reducing jailbreak success rates

🛡️ Threat Analysis


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
black_boxinference_time
Datasets
AdvBenchHarmBenchJailbreakBench
Applications
instruction-tuned llmschatbotsllm safety pipelines