survey 2025

Robust Agents in Open-Ended Worlds

Mikayel Samvelyan

0 citations · arXiv

α

Published on arXiv

2512.08139

Input Manipulation Attack

OWASP ML Top 10 — ML01

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Evolutionary adversarial prompt search (Rainbow Teaming) produces diverse, effective jailbreaks against LLMs while quality-diversity methods (MADRID) uncover structured failure modes in state-of-the-art RL policies.

Rainbow Teaming

Novel technique introduced


The growing prevalence of artificial intelligence (AI) in various applications underscores the need for agents that can successfully navigate and adapt to an ever-changing, open-ended world. A key challenge is ensuring these AI agents are robust, excelling not only in familiar settings observed during training but also effectively generalising to previously unseen and varied scenarios. In this thesis, we harness methodologies from open-endedness and multi-agent learning to train and evaluate robust AI agents capable of generalising to novel environments, out-of-distribution inputs, and interactions with other co-player agents. We begin by introducing MiniHack, a sandbox framework for creating diverse environments through procedural content generation. Based on the game of NetHack, MiniHack enables the construction of new tasks for reinforcement learning (RL) agents with a focus on generalisation. We then present Maestro, a novel approach for generating adversarial curricula that progressively enhance the robustness and generality of RL agents in two-player zero-sum games. We further probe robustness in multi-agent domains, utilising quality-diversity methods to systematically identify vulnerabilities in state-of-the-art, pre-trained RL policies within the complex video game football domain, characterised by intertwined cooperative and competitive dynamics. Finally, we extend our exploration of robustness to the domain of LLMs. Here, our focus is on diagnosing and enhancing the robustness of LLMs against adversarial prompts, employing evolutionary search to generate a diverse range of effective inputs that aim to elicit undesirable outputs from an LLM. This work collectively paves the way for future advancements in AI robustness, enabling the development of agents that not only adapt to an ever-evolving world but also thrive in the face of unforeseen challenges and interactions.


Key Contributions

  • MiniHack: sandbox framework for procedurally generated RL environments enabling generalization benchmarking
  • Maestro: adversarial curriculum generation for training robust RL agents in two-player zero-sum games
  • MADRID: quality-diversity method for systematically diagnosing vulnerabilities in pre-trained multi-agent RL policies
  • Rainbow Teaming: evolutionary search approach to generate diverse adversarial prompts that elicit unsafe LLM behavior

🛡️ Threat Analysis

Input Manipulation Attack

MADRID uses quality-diversity methods to systematically identify adversarial co-player strategies that exploit vulnerabilities in state-of-the-art pre-trained RL policies. Maestro generates adversarial curricula (environments) to harden RL agents against worst-case inputs. Both address adversarial robustness of RL agents at inference/evaluation time.


Details

Domains
reinforcement-learningnlp
Model Types
rlllm
Threat Tags
black_boxinference_time
Datasets
NetHackGoogle Research Football
Applications
reinforcement learning agentslarge language modelsmulti-agent game ai