attack 2025

Uncovering the Persuasive Fingerprint of LLMs in Jailbreaking Attacks

Havva Alizadeh Noughabi , Julien Serbanescu , Fattane Zarrinkalam , Ali Dehghantanha

0 citations · 20 references · CIKM

α

Published on arXiv

2510.21983

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Persuasion-aware adversarial prompts significantly bypass alignment safeguards across multiple aligned LLMs

Persuasion-based jailbreaking

Novel technique introduced


Despite recent advances, Large Language Models remain vulnerable to jailbreak attacks that bypass alignment safeguards and elicit harmful outputs. While prior research has proposed various attack strategies differing in human readability and transferability, little attention has been paid to the linguistic and psychological mechanisms that may influence a model's susceptibility to such attacks. In this paper, we examine an interdisciplinary line of research that leverages foundational theories of persuasion from the social sciences to craft adversarial prompts capable of circumventing alignment constraints in LLMs. Drawing on well-established persuasive strategies, we hypothesize that LLMs, having been trained on large-scale human-generated text, may respond more compliantly to prompts with persuasive structures. Furthermore, we investigate whether LLMs themselves exhibit distinct persuasive fingerprints that emerge in their jailbreak responses. Empirical evaluations across multiple aligned LLMs reveal that persuasion-aware prompts significantly bypass safeguards, demonstrating their potential to induce jailbreak behaviors. This work underscores the importance of cross-disciplinary insight in addressing the evolving challenges of LLM safety. The code and data are available.


Key Contributions

  • Applies social-science persuasion theory to systematically craft jailbreak prompts, showing LLMs trained on human text are susceptible to persuasive structures
  • Uses WizardLM to automatically rewrite harmful queries with persuasion principles for black-box jailbreaking of aligned LLMs
  • Identifies and characterizes distinct 'persuasive fingerprints' that LLMs exhibit in their jailbreak responses

🛡️ Threat Analysis


Details

Domains
nlp
Model Types
llm
Threat Tags
black_boxinference_time
Applications
aligned llmsconversational ai safety