benchmark 2026

When Personalization Legitimizes Risks: Uncovering Safety Vulnerabilities in Personalized Dialogue Agents

Jiahe Guo ¹, Xiangran Guo ¹, Yulin Hu ¹, Zimo Long ¹, Xingyu Sui ¹, Xuda Zhi ², Yongbo Huang ², Hao He ², Weixiang Zhao ¹, Yanyan Zhao ¹, Bing Qin ¹

¹ Harbin Institute of Technology

² SERES Group Co., Ltd

0 citations · 39 references · arXiv

Published on arXiv

2601.17887

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Personalization increases attack success rates by 15.8%–243.7% relative to stateless baselines across multiple memory-augmented agent frameworks, with degradation strongly conditioned on semantic alignment between retrieved memories and harmful queries

PS-Bench

Novel technique introduced

Long-term memory enables large language model (LLM) agents to support personalized and sustained interactions. However, most work on personalized agents prioritizes utility and user experience, treating memory as a neutral component and largely overlooking its safety implications. In this paper, we reveal intent legitimation, a previously underexplored safety failure in personalized agents, where benign personal memories bias intent inference and cause models to legitimize inherently harmful queries. To study this phenomenon, we introduce PS-Bench, a benchmark designed to identify and quantify intent legitimation in personalized interactions. Across multiple memory-augmented agent frameworks and base LLMs, personalization increases attack success rates by 15.8%-243.7% relative to stateless baselines. We further provide mechanistic evidence for intent legitimation from internal representations space, and propose a lightweight detection-reflection method that effectively reduces safety degradation. Overall, our work provides the first systematic exploration and evaluation of intent legitimation as a safety failure mode that naturally arises from benign, real-world personalization, highlighting the importance of assessing safety under long-term personal context. WARNING: This paper may contain harmful content.

Key Contributions

Identifies 'intent legitimation' as a novel safety failure mode in personalized LLM agents where benign accumulated memory biases intent inference toward legitimizing harmful queries
Introduces PS-Bench with two extensions (Thematic Chat History Augmentation and Persona-Grounded Harmful Queries) to systematically measure safety degradation from personalization across five agent frameworks and five LLMs
Proposes a lightweight detection-reflection defense method and provides mechanistic evidence for intent legitimation via internal representation analysis

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llm

Threat Tags

inference_timeblack_boxdigital

Datasets

AdvBenchPS-Bench

Applications

personalized dialogue agentsllm agents with long-term memorypersonal assistantseducational agentshealthcare chatbots

Read PDF arXiv DOI

When Personalization Legitimizes Risks: Uncovering Safety Vulnerabilities in Personalized Dialogue Agents

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

TeleAI-Safety: A comprehensive LLM jailbreaking benchmark towards attacks, defenses, and evaluations

More Agents Helps but Adversarial Robustness Gap Persists

OI-Bench: An Option Injection Benchmark for Evaluating LLM Susceptibility to Directive Interference

Reverse CAPTCHA: Evaluating LLM Susceptibility to Invisible Unicode Instruction Injection

Decoding Latent Attack Surfaces in LLMs: Prompt Injection via HTML in Web Summarization

Death by a Thousand Prompts: Open Model Vulnerability Analysis

Overlooked Safety Vulnerability in LLMs: Malicious Intelligent Optimization Algorithm Request and its Jailbreak

Efficient LLM Safety Evaluation through Multi-Agent Debate