defense 2026

Stay in Character, Stay Safe: Dual-Cycle Adversarial Self-Evolution for Safety Role-Playing Agents

Mingyang Liao ^1,2, Yichen Wan ¹, shuchen wu ¹, Chenxi Miao ¹, Xin Shen ¹, Weikang Li ¹, Yang Li ^1,3, Deguo Xia ¹, Jizhou Huang ²

¹ Baidu Inc.

² The University of Queensland

³ Peking University

0 citations · 33 references · arXiv (Cornell University)

Published on arXiv

2602.13234

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Achieves consistent gains over strong baselines on both role-playing fidelity and jailbreak resistance across multiple proprietary LLMs, with robust generalization to unseen personas and novel attack prompts

DASE (Dual-Cycle Adversarial Self-Evolution)

Novel technique introduced

LLM-based role-playing has rapidly improved in fidelity, yet stronger adherence to persona constraints commonly increases vulnerability to jailbreak attacks, especially for risky or negative personas. Most prior work mitigates this issue with training-time solutions (e.g., data curation or alignment-oriented regularization). However, these approaches are costly to maintain as personas and attack strategies evolve, can degrade in-character behavior, and are typically infeasible for frontier closed-weight LLMs. We propose a training-free Dual-Cycle Adversarial Self-Evolution framework with two coupled cycles. A Persona-Targeted Attacker Cycle synthesizes progressively stronger jailbreak prompts, while a Role-Playing Defender Cycle distills observed failures into a hierarchical knowledge base of (i) global safety rules, (ii) persona-grounded constraints, and (iii) safe in-character exemplars. At inference time, the Defender retrieves and composes structured knowledge from this hierarchy to guide generation, producing responses that remain faithful to the target persona while satisfying safety constraints. Extensive experiments across multiple proprietary LLMs show consistent gains over strong baselines on both role fidelity and jailbreak resistance, and robust generalization to unseen personas and attack prompts.

Key Contributions

Training-free Dual-Cycle Adversarial Self-Evolution (DASE) framework that requires no parameter updates, making it applicable to closed-weight proprietary LLMs
Persona-Targeted Attacker Cycle that auto-generates progressively stronger jailbreak prompts targeting role-playing vulnerabilities
Role-Playing Defender Cycle that distills failures into a three-tier hierarchical knowledge base (global safety rules, persona-grounded constraints, safe exemplars) retrieved at inference time to balance fidelity and safety

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

black_boxinference_time

Applications

llm role-playing agentspersona-constrained chatbots

Read PDF arXiv DOI Code

Stay in Character, Stay Safe: Dual-Cycle Adversarial Self-Evolution for Safety Role-Playing Agents

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

PISanitizer: Preventing Prompt Injection to Long-Context LLMs via Prompt Sanitization

Jailbreaking Leaves a Trace: Understanding and Detecting Jailbreak Attacks from Internal Representations of Large Language Models

Securing AI Agents Against Prompt Injection Attacks

Mitigating Indirect Prompt Injection via Instruction-Following Intent Analysis

From static to adaptive: immune memory-based jailbreak detection for large language models

Knowing When Not to Answer: Lightweight KB-Aligned OOD Detection for Safe RAG

Defend LLMs Through Self-Consciousness

Prefix Probing: Lightweight Harmful Content Detection for Large Language Models