defense 2025

AdvEvo-MARL: Shaping Internalized Safety through Adversarial Co-Evolution in Multi-Agent Reinforcement Learning

Zhenyu Pan ¹, Yiting Zhang ², Zhuo Liu ³, Yolo Yunlong Tang ³, Zeliang Zhang ³, Haozheng Luo ¹, Yuwei Han ², Jianshu Zhang ¹, Dennis Wu ¹, Hong-Yu Chen ¹, Haoran Lu ¹, Haoyang Fang ⁴, Manling Li ¹, Chenliang Xu ³, Philip S. Yu ², Han Liu ¹

¹ Northwestern University

² University of Illinois at Chicago

³ University of Rochester

⁴ Carnegie Mellon University

1 citations · 32 references · arXiv

Published on arXiv

2510.01586

Prompt Injection

OWASP LLM Top 10 — LLM01

Excessive Agency

OWASP LLM Top 10 — LLM08

Key Finding

AdvEvo-MARL keeps attack-success rate below 20% across all tested scenarios while baselines reach up to 38.33%, and improves task accuracy by up to 3.67% on reasoning benchmarks.

AdvEvo-MARL

Novel technique introduced

LLM-based multi-agent systems excel at planning, tool use, and role coordination, but their openness and interaction complexity also expose them to jailbreak, prompt-injection, and adversarial collaboration. Existing defenses fall into two lines: (i) self-verification that asks each agent to pre-filter unsafe instructions before execution, and (ii) external guard modules that police behaviors. The former often underperforms because a standalone agent lacks sufficient capacity to detect cross-agent unsafe chains and delegation-induced risks; the latter increases system overhead and creates a single-point-of-failure-once compromised, system-wide safety collapses, and adding more guards worsens cost and complexity. To solve these challenges, we propose AdvEvo-MARL, a co-evolutionary multi-agent reinforcement learning framework that internalizes safety into task agents. Rather than relying on external guards, AdvEvo-MARL jointly optimizes attackers (which synthesize evolving jailbreak prompts) and defenders (task agents trained to both accomplish their duties and resist attacks) in adversarial learning environments. To stabilize learning and foster cooperation, we introduce a public baseline for advantage estimation: agents within the same functional group share a group-level mean-return baseline, enabling lower-variance updates and stronger intra-group coordination. Across representative attack scenarios, AdvEvo-MARL consistently keeps attack-success rate (ASR) below 20%, whereas baselines reach up to 38.33%, while preserving-and sometimes improving-task accuracy (up to +3.67% on reasoning tasks). These results show that safety and utility can be jointly improved without relying on extra guard agents or added system overhead.

Key Contributions

AdvEvo-MARL: a co-evolutionary MARL framework that jointly trains attacker agents (synthesizing evolving jailbreak prompts) and defender agents (task agents that resist attacks), internalizing safety without external guard modules.
Public baseline mechanism for group-level advantage estimation, where agents within the same functional group share the group's mean return as a baseline to reduce variance and strengthen intra-group coordination.
Empirical demonstration across three MAS attack scenarios (agent manipulation, message corruption, user instruction hijacking) keeping ASR below 20% vs. baselines up to 38.33%, with up to +3.67% task accuracy gain.

🛡️ Threat Analysis

Details

Domains

nlpreinforcement-learning

Model Types

llmrl

Threat Tags

inference_timeblack_box

Applications

llm-based multi-agent systemsagentic ai safety

Read PDF arXiv DOI

AdvEvo-MARL: Shaping Internalized Safety through Adversarial Co-Evolution in Multi-Agent Reinforcement Learning

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

SafeSearch: Do Not Trade Safety for Utility in LLM Search Agents

OpenSec: Measuring Incident Response Agent Calibration Under Adversarial Evidence

Building Browser Agents: Architecture, Security, and Practical Solutions

Beyond Input Guardrails: Reconstructing Cross-Agent Semantic Flows for Execution-Aware Attack Detection

AgentWatcher: A Rule-based Prompt Injection Monitor

Deep Research with Open-Domain Evaluation and Multi-Stage Guardrails for Safety

MAS-Shield: A Defense Framework for Secure and Efficient LLM MAS

AgentArmor: Enforcing Program Analysis on Agent Runtime Trace to Defend Against Prompt Injection