benchmark arXiv Jan 6, 2026 · Jan 2026
Xiangzhe Yuan, Zhenhao Zhang, Haoming Tang et al. · University of Iowa · City University of Hong Kong
Red-teams eight LLMs as conversational scam attackers and victims across 18,648 multi-turn dialogues to map safety failure modes
Prompt Injection nlp
As LLMs gain persuasive agentic capabilities through extended dialogues, they introduce novel risks in multi-turn conversational scams that single-turn safety evaluations fail to capture. We systematically study these risks using a controlled LLM-to-LLM simulation framework across multi-turn scam scenarios. Evaluating eight state-of-the-art models in English and Chinese, we analyze dialogue outcomes and qualitatively annotate attacker strategies, defensive responses, and failure modes. Results reveal that scam interactions follow recurrent escalation patterns, while defenses employ verification and delay mechanisms. Furthermore, interactional failures frequently stem from safety guardrail activation and role instability. Our findings highlight multi-turn interactional safety as a critical, distinct dimension of LLM behavior.
llm transformer University of Iowa · City University of Hong Kong