benchmark 2026

When Personalization Legitimizes Risks: Uncovering Safety Vulnerabilities in Personalized Dialogue Agents

Jiahe Guo 1, Xiangran Guo 1, Yulin Hu 1, Zimo Long 1, Xingyu Sui 1, Xuda Zhi 2, Yongbo Huang 2, Hao He 2, Weixiang Zhao 1, Yanyan Zhao 1, Bing Qin 1

0 citations · 39 references · arXiv

α

Published on arXiv

2601.17887

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Personalization increases attack success rates by 15.8%–243.7% relative to stateless baselines across multiple memory-augmented agent frameworks, with degradation strongly conditioned on semantic alignment between retrieved memories and harmful queries

PS-Bench

Novel technique introduced


Long-term memory enables large language model (LLM) agents to support personalized and sustained interactions. However, most work on personalized agents prioritizes utility and user experience, treating memory as a neutral component and largely overlooking its safety implications. In this paper, we reveal intent legitimation, a previously underexplored safety failure in personalized agents, where benign personal memories bias intent inference and cause models to legitimize inherently harmful queries. To study this phenomenon, we introduce PS-Bench, a benchmark designed to identify and quantify intent legitimation in personalized interactions. Across multiple memory-augmented agent frameworks and base LLMs, personalization increases attack success rates by 15.8%-243.7% relative to stateless baselines. We further provide mechanistic evidence for intent legitimation from internal representations space, and propose a lightweight detection-reflection method that effectively reduces safety degradation. Overall, our work provides the first systematic exploration and evaluation of intent legitimation as a safety failure mode that naturally arises from benign, real-world personalization, highlighting the importance of assessing safety under long-term personal context. WARNING: This paper may contain harmful content.


Key Contributions

  • Identifies 'intent legitimation' as a novel safety failure mode in personalized LLM agents where benign accumulated memory biases intent inference toward legitimizing harmful queries
  • Introduces PS-Bench with two extensions (Thematic Chat History Augmentation and Persona-Grounded Harmful Queries) to systematically measure safety degradation from personalization across five agent frameworks and five LLMs
  • Proposes a lightweight detection-reflection defense method and provides mechanistic evidence for intent legitimation via internal representation analysis

🛡️ Threat Analysis


Details

Domains
nlp
Model Types
llm
Threat Tags
inference_timeblack_boxdigital
Datasets
AdvBenchPS-Bench
Applications
personalized dialogue agentsllm agents with long-term memorypersonal assistantseducational agentshealthcare chatbots