defense 2026

Stay in Character, Stay Safe: Dual-Cycle Adversarial Self-Evolution for Safety Role-Playing Agents

Mingyang Liao 1,2, Yichen Wan 1, shuchen wu 1, Chenxi Miao 1, Xin Shen 1, Weikang Li 1, Yang Li 1,3, Deguo Xia 1, Jizhou Huang 2

0 citations · 33 references · arXiv (Cornell University)

α

Published on arXiv

2602.13234

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Achieves consistent gains over strong baselines on both role-playing fidelity and jailbreak resistance across multiple proprietary LLMs, with robust generalization to unseen personas and novel attack prompts

DASE (Dual-Cycle Adversarial Self-Evolution)

Novel technique introduced


LLM-based role-playing has rapidly improved in fidelity, yet stronger adherence to persona constraints commonly increases vulnerability to jailbreak attacks, especially for risky or negative personas. Most prior work mitigates this issue with training-time solutions (e.g., data curation or alignment-oriented regularization). However, these approaches are costly to maintain as personas and attack strategies evolve, can degrade in-character behavior, and are typically infeasible for frontier closed-weight LLMs. We propose a training-free Dual-Cycle Adversarial Self-Evolution framework with two coupled cycles. A Persona-Targeted Attacker Cycle synthesizes progressively stronger jailbreak prompts, while a Role-Playing Defender Cycle distills observed failures into a hierarchical knowledge base of (i) global safety rules, (ii) persona-grounded constraints, and (iii) safe in-character exemplars. At inference time, the Defender retrieves and composes structured knowledge from this hierarchy to guide generation, producing responses that remain faithful to the target persona while satisfying safety constraints. Extensive experiments across multiple proprietary LLMs show consistent gains over strong baselines on both role fidelity and jailbreak resistance, and robust generalization to unseen personas and attack prompts.


Key Contributions

  • Training-free Dual-Cycle Adversarial Self-Evolution (DASE) framework that requires no parameter updates, making it applicable to closed-weight proprietary LLMs
  • Persona-Targeted Attacker Cycle that auto-generates progressively stronger jailbreak prompts targeting role-playing vulnerabilities
  • Role-Playing Defender Cycle that distills failures into a three-tier hierarchical knowledge base (global safety rules, persona-grounded constraints, safe exemplars) retrieved at inference time to balance fidelity and safety

🛡️ Threat Analysis


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
black_boxinference_time
Applications
llm role-playing agentspersona-constrained chatbots