Robust Agents in Open-Ended Worlds

The growing prevalence of artificial intelligence (AI) in various applications underscores the need for agents that can successfully navigate and adapt to an ever-changing, open-ended world. A key challenge is ensuring these AI agents are robust, excelling not only in familiar settings observed during training but also effectively generalising to previously unseen and varied scenarios. In this thesis, we harness methodologies from open-endedness and multi-agent learning to train and evaluate robust AI agents capable of generalising to novel environments, out-of-distribution inputs, and interactions with other co-player agents. We begin by introducing MiniHack, a sandbox framework for creating diverse environments through procedural content generation. Based on the game of NetHack, MiniHack enables the construction of new tasks for reinforcement learning (RL) agents with a focus on generalisation. We then present Maestro, a novel approach for generating adversarial curricula that progressively enhance the robustness and generality of RL agents in two-player zero-sum games. We further probe robustness in multi-agent domains, utilising quality-diversity methods to systematically identify vulnerabilities in state-of-the-art, pre-trained RL policies within the complex video game football domain, characterised by intertwined cooperative and competitive dynamics. Finally, we extend our exploration of robustness to the domain of LLMs. Here, our focus is on diagnosing and enhancing the robustness of LLMs against adversarial prompts, employing evolutionary search to generate a diverse range of effective inputs that aim to elicit undesirable outputs from an LLM. This work collectively paves the way for future advancements in AI robustness, enabling the development of agents that not only adapt to an ever-evolving world but also thrive in the face of unforeseen challenges and interactions.

Key Contributions

MiniHack: sandbox framework for procedurally generated RL environments enabling generalization benchmarking
Maestro: adversarial curriculum generation for training robust RL agents in two-player zero-sum games
MADRID: quality-diversity method for systematically diagnosing vulnerabilities in pre-trained multi-agent RL policies
Rainbow Teaming: evolutionary search approach to generate diverse adversarial prompts that elicit unsafe LLM behavior

🛡️ Threat Analysis

Input Manipulation Attack

MADRID uses quality-diversity methods to systematically identify adversarial co-player strategies that exploit vulnerabilities in state-of-the-art pre-trained RL policies. Maestro generates adversarial curricula (environments) to harden RL agents against worst-case inputs. Both address adversarial robustness of RL agents at inference/evaluation time.