defense 2026

MAGIC: A Co-Evolving Attacker-Defender Adversarial Game for Robust LLM Safety

Xiaoyu Wen ^1,2, Zhida He ¹, Han Qi ¹, Ziyu Wan ², Zhongtian Ma ¹, Ying Wen ², Tianhang Zheng ³, Xingcheng Xu ¹, Chaochao Lu ¹, Qiaosheng Zhang ¹

¹ Shanghai AI Laboratory

² Shanghai Jiao Tong University

³ Zhejiang University

0 citations · 74 references · arXiv (Cornell University)

Published on arXiv

2602.01539

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

MAGIC achieves superior defense success rates against adaptive multi-turn jailbreak attacks while preserving model helpfulness, with the attacker co-evolving novel combinatorial attack strategies through iterative RL training

MAGIC

Novel technique introduced

Ensuring robust safety alignment is crucial for Large Language Models (LLMs), yet existing defenses often lag behind evolving adversarial attacks due to their \textbf{reliance on static, pre-collected data distributions}. In this paper, we introduce \textbf{MAGIC}, a novel multi-turn multi-agent reinforcement learning framework that formulates LLM safety alignment as an adversarial asymmetric game. Specifically, an attacker agent learns to iteratively rewrite original queries into deceptive prompts, while a defender agent simultaneously optimizes its policy to recognize and refuse such inputs. This dynamic process triggers a \textbf{co-evolution}, where the attacker's ever-changing strategies continuously uncover long-tail vulnerabilities, driving the defender to generalize to unseen attack patterns. Remarkably, we observe that the attacker, endowed with initial reasoning ability, evolves \textbf{novel, previously unseen combinatorial strategies} through iterative RL training, underscoring our method's substantial potential. Theoretically, we provide insights into a more robust game equilibrium and derive safety guarantees. Extensive experiments validate our framework's effectiveness, demonstrating superior defense success rates without compromising the helpfulness of the model. Our code is available at https://github.com/BattleWen/MAGIC.

Key Contributions

Asymmetric multi-agent RL framework (MAGIC) that decouples attacker and defender LLM agents to co-evolve without gradient conflicts, guided by Subgame Perfect Nash Equilibrium theory
Attack Pool Benchmark with 20 diverse CoT rewriting strategies to bootstrap attacker offensive reasoning and enable long-tail vulnerability exploration beyond static red-teaming datasets
Empirical demonstration that adversarial co-evolution produces novel, compositional jailbreak strategies not present in human-crafted templates, while improving defender generalization to unseen attacks

🛡️ Threat Analysis

Details

Domains

nlpreinforcement-learning

Model Types

llmrl

Threat Tags

inference_timeblack_box

Applications

llm safety alignmentautomated red-teamingjailbreak defense

Read PDF arXiv DOI Code

MAGIC: A Co-Evolving Attacker-Defender Adversarial Game for Robust LLM Safety

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Adversarial Reinforcement Learning for Large Language Model Agent Safety

TriPlay-RL: Tri-Role Self-Play Reinforcement Learning for LLM Safety Alignment

Automatic LLM Red Teaming

SafeSearch: Do Not Trade Safety for Utility in LLM Search Agents

AdvEvo-MARL: Shaping Internalized Safety through Adversarial Co-Evolution in Multi-Agent Reinforcement Learning

Certifiable Safe RLHF: Fixed-Penalty Constraint Optimization for Safer Language Models

TROJail: Trajectory-Level Optimization for Multi-Turn Large Language Model Jailbreaks with Process Rewards

Be Your Own Red Teamer: Safety Alignment via Self-Play and Reflective Experience Replay