Stay in Character, Stay Safe: Dual-Cycle Adversarial Self-Evolution for Safety Role-Playing Agents
Mingyang Liao 1,2, Yichen Wan 1, shuchen wu 1, Chenxi Miao 1, Xin Shen 1, Weikang Li 1, Yang Li 1,3, Deguo Xia 1, Jizhou Huang 2
Published on arXiv
2602.13234
Prompt Injection
OWASP LLM Top 10 — LLM01
Key Finding
Achieves consistent gains over strong baselines on both role-playing fidelity and jailbreak resistance across multiple proprietary LLMs, with robust generalization to unseen personas and novel attack prompts
DASE (Dual-Cycle Adversarial Self-Evolution)
Novel technique introduced
LLM-based role-playing has rapidly improved in fidelity, yet stronger adherence to persona constraints commonly increases vulnerability to jailbreak attacks, especially for risky or negative personas. Most prior work mitigates this issue with training-time solutions (e.g., data curation or alignment-oriented regularization). However, these approaches are costly to maintain as personas and attack strategies evolve, can degrade in-character behavior, and are typically infeasible for frontier closed-weight LLMs. We propose a training-free Dual-Cycle Adversarial Self-Evolution framework with two coupled cycles. A Persona-Targeted Attacker Cycle synthesizes progressively stronger jailbreak prompts, while a Role-Playing Defender Cycle distills observed failures into a hierarchical knowledge base of (i) global safety rules, (ii) persona-grounded constraints, and (iii) safe in-character exemplars. At inference time, the Defender retrieves and composes structured knowledge from this hierarchy to guide generation, producing responses that remain faithful to the target persona while satisfying safety constraints. Extensive experiments across multiple proprietary LLMs show consistent gains over strong baselines on both role fidelity and jailbreak resistance, and robust generalization to unseen personas and attack prompts.
Key Contributions
- Training-free Dual-Cycle Adversarial Self-Evolution (DASE) framework that requires no parameter updates, making it applicable to closed-weight proprietary LLMs
- Persona-Targeted Attacker Cycle that auto-generates progressively stronger jailbreak prompts targeting role-playing vulnerabilities
- Role-Playing Defender Cycle that distills failures into a three-tier hierarchical knowledge base (global safety rules, persona-grounded constraints, safe exemplars) retrieved at inference time to balance fidelity and safety