benchmark 2025

A Multilingual, Large-Scale Study of the Interplay between LLM Safeguards, Personalisation, and Disinformation

1 citations · 1 influential · 92 references · arXiv

Published on arXiv

2510.12993

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Persona-targeted prompts increased jailbreak rates up to 10 percentage points across all 8 LLMs, with Grok and GPT each exceeding 85% jailbreak and personalisation scores.

Large Language Models (LLMs) can generate human-like disinformation, yet their ability to personalise such content across languages and demographics remains underexplored. This study presents the first large-scale, multilingual analysis of persona-targeted disinformation generation by LLMs. Employing a red teaming methodology, we prompt eight state-of-the-art LLMs with 324 false narratives and 150 demographic personas (combinations of country, generation, and political orientation) across four languages--English, Russian, Portuguese, and Hindi--resulting in AI-TRAITS, a comprehensive dataset of 1.6 million personalised disinformation texts. Results show that the use of even simple personalisation prompts significantly increases the likelihood of jailbreaks across all studied LLMs, up to 10 percentage points, and alters linguistic and rhetorical patterns that enhance narrative persuasiveness. Models such as Grok and GPT exhibited jailbreak rates and personalisation scores both exceeding 85%. These insights expose critical vulnerabilities in current state-of-the-art LLMs and offer a foundation for improving safety alignment and detection strategies in multilingual and cross-demographic contexts.

Key Contributions

First large-scale multilingual (English, Russian, Portuguese, Hindi) benchmark of persona-targeted disinformation generation across 8 state-of-the-art LLMs
AI-TRAITS dataset: 1.6 million personalised disinformation texts seeded by 324 false narratives and 150 demographic personas
Empirical finding that simple personalisation prompts increase jailbreak rates up to 10 percentage points and alter linguistic/rhetorical patterns to amplify persuasiveness

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llm

Threat Tags

black_boxinference_time

Datasets

AI-TRAITS

Applications

llm safety systemsdisinformation detectionsafety alignment evaluation

Read PDF arXiv DOI

A Multilingual, Large-Scale Study of the Interplay between LLM Safeguards, Personalisation, and Disinformation

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Beyond Fixed and Dynamic Prompts: Embedded Jailbreak Templates for Advancing LLM Security

RedBench: A Universal Dataset for Comprehensive Red Teaming of Large Language Models

Quantifying CBRN Risk in Frontier Models

Understanding LLM Behavior When Encountering User-Supplied Harmful Content in Harmless Tasks

MalURLBench: A Benchmark Evaluating Agents' Vulnerabilities When Processing Web URLs

When Your Reviewer is an LLM: Biases, Divergence, and Prompt Injection Risks in Peer Review

Read the Scene, Not the Script: Outcome-Aware Safety for LLMs

SocialHarmBench: Revealing LLM Vulnerabilities to Socially Harmful Requests