benchmark 2025

A Multilingual, Large-Scale Study of the Interplay between LLM Safeguards, Personalisation, and Disinformation

João A. Leite 1, Arnav Arora 2, Silvia Gargova 3, João Luz 4, Gustavo Sampaio 4, Ian Roberts 1, Carolina Scarton 1, Kalina Bontcheva 1

1 citations · 1 influential · 92 references · arXiv

α

Published on arXiv

2510.12993

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Persona-targeted prompts increased jailbreak rates up to 10 percentage points across all 8 LLMs, with Grok and GPT each exceeding 85% jailbreak and personalisation scores.


Large Language Models (LLMs) can generate human-like disinformation, yet their ability to personalise such content across languages and demographics remains underexplored. This study presents the first large-scale, multilingual analysis of persona-targeted disinformation generation by LLMs. Employing a red teaming methodology, we prompt eight state-of-the-art LLMs with 324 false narratives and 150 demographic personas (combinations of country, generation, and political orientation) across four languages--English, Russian, Portuguese, and Hindi--resulting in AI-TRAITS, a comprehensive dataset of 1.6 million personalised disinformation texts. Results show that the use of even simple personalisation prompts significantly increases the likelihood of jailbreaks across all studied LLMs, up to 10 percentage points, and alters linguistic and rhetorical patterns that enhance narrative persuasiveness. Models such as Grok and GPT exhibited jailbreak rates and personalisation scores both exceeding 85%. These insights expose critical vulnerabilities in current state-of-the-art LLMs and offer a foundation for improving safety alignment and detection strategies in multilingual and cross-demographic contexts.


Key Contributions

  • First large-scale multilingual (English, Russian, Portuguese, Hindi) benchmark of persona-targeted disinformation generation across 8 state-of-the-art LLMs
  • AI-TRAITS dataset: 1.6 million personalised disinformation texts seeded by 324 false narratives and 150 demographic personas
  • Empirical finding that simple personalisation prompts increase jailbreak rates up to 10 percentage points and alter linguistic/rhetorical patterns to amplify persuasiveness

🛡️ Threat Analysis


Details

Domains
nlp
Model Types
llm
Threat Tags
black_boxinference_time
Datasets
AI-TRAITS
Applications
llm safety systemsdisinformation detectionsafety alignment evaluation