About
I am a postdoctoral researcher at Microsoft Research in Cambridge, UK, part of the Game Intelligence group. My research is about efficient learning for decision-making: how agents can learn from limited experience — both in the form of offline data or online interactions — to act and solve complex tasks. At MSR, I use modern video games as a testbed representing complex decision-making tasks with high-fidelity visuals to study foundamental questions about decision-making agents, and to develop algorithms that can learn efficiently in these environments.
Before joining MSR, I completed my PhD and MSc at the University of Edinburgh, supervised by Stefano Albrecht and Amos Storkey. My PhD research studied efficient exploration strategies for deep reinforcement learning agents in both single-agent and multi-agent settings.
Together with Stefano and Filippos Christianos, I co-authored the first introductory textbook on multi-agent reinforcement learning, published by MIT Press in 2024 and freely available at www.marl-book.com. The book has been downloaded over 50,000 times and is adopted for teaching university courses.
Outside of research, I decompress by going for runs, reading, and playing tabletop roleplaying and board games with friends.
News
- Feb 2026 Talk Invited talk at Imperial College London for the I-X Seminar series.
- Feb 2026 Blog MSR blog post on Rethinking Imitation Learning with Predictive Inverse Dynamics Models. We show that predictive inverse dynamics models (PIDM) represent a sample-efficient alternative to behaviour cloning for offline imitation learning.
- Jan 2026 Talk Invited talk at the Cohere Labs Seminar.
- Dec 2025 Talk Invited talk at the Interactive Learning and Interventional Representations Workshop at the EurIPS conference in Copenhagen.
- Jul 2025 Talk Invited talk on ‘Decision Making in Video Games’ at the BeNeRL 2025 Workshop in Eindhoven.
- Jun 2025 Award Very happy to receive a best reviewer award at ICML 2025!
- Apr 2025 Blog MSR blog on WHAMM! Real-Time World Modelling of Interactive Environments. We trained a world model to simulate video games in real-time, achieving 10+ fps at 640×360 resolution.
- Apr 2025 Paper Presenting Visual Encoders for Data-Efficient Imitation Learning in Modern Video Games at the Adaptive and Learning Agents (ALA) Workshop @ AAMAS 2025.
- Mar 2025 Event Wrapped up co-organising the UK Multi-Agent Systems Symposium 2025 with ~200 participants at the Alan Turing Institute.
- Dec 2024 Paper Ensemble Value Functions for Efficient Exploration in Multi-Agent Reinforcement Learning accepted as oral presentation at AAMAS 2025.
Selected Publications
-
Multi-Agent Reinforcement Learning: Foundations and Modern ApproachesMIT Press · Textbook · 202450,000+ downloadsThis book provides a comprehensive introduction to multi-agent reinforcement learning (MARL), a rapidly growing field that combines insights from game theory, reinforcement learning, and multi-agent systems. The book covers the foundations of MARL, including the key concepts, algorithms, and challenges, and presents a detailed overview of contemporary approaches of deep MARL research.
-
When does Predictive Inverse Dynamics Outperform Behavior Cloning?arXiv preprint · 2026Behavior cloning (BC) is a practical offline imitation learning method, but it often fails when expert demonstrations are limited. Recent works have introduced a class of architectures named predictive inverse dynamics models (PIDM) that combine a future state predictor with an inverse dynamics model (IDM). While PIDM often outperforms BC, the reasons behind its benefits remain unclear. In this paper, we provide a theoretical explanation: PIDM introduces a bias-variance tradeoff. While predicting the future state introduces bias, conditioning the IDM on the prediction can significantly reduce variance. We establish conditions on the state predictor bias for PIDM to achieve lower prediction error and higher sample efficiency than BC, with the gap widening when additional data sources are available. We validate the theoretical insights empirically in 2D navigation tasks, where BC requires up to five times (three times on average) more demonstrations than PIDM to reach comparable performance; and in a complex 3D environment in a modern video game with high-dimensional visual inputs and stochastic transitions, where BC requires over 66% more samples than PIDM.
-
Visual Encoders for Data-Efficient Imitation Learning in Modern Video GamesAdaptive and Learning Agents (ALA) Workshop @ AAMAS · 2025Video games have served as useful benchmarks for the decision-making community, but going beyond Atari games towards modern games has been prohibitively expensive for the vast majority of the research community. Prior work in modern video games typically relied on game-specific integration to obtain game features and enable online training, or on existing large datasets. An alternative approach is to train agents using imitation learning to play video games purely from images. However, this setting poses a fundamental question: which visual encoders obtain representations that retain information critical for decision making? To answer this question, we conduct a systematic study of imitation learning with publicly available pre-trained visual encoders compared to the typical task-specific end-to-end training approach in Minecraft, Counter-Strike: Global Offensive, and Minecraft Dungeons. Our results show that end-to-end training can be effective with comparably low-resolution images and only minutes of demonstrations, but significant improvements can be gained by utilising pre-trained encoders such as DINOv2 depending on the game. In addition to enabling effective decision making, we show that pre-trained encoders can make decision-making research in video games more accessible by significantly reducing the cost of training.
-
Ensemble Value Functions for Efficient Exploration in Multi-Agent Reinforcement LearningInternational Conference on Autonomous Agents and Multiagent Systems · AAMAS · 2025Oral paper presentationMulti-agent reinforcement learning (MARL) requires agents to explore within a vast joint action space to find joint actions that lead to coordination. Existing value-based MARL algorithms commonly rely on random exploration, such as epsilon-greedy, to explore the environment which is not systematic and inefficient at identifying effective actions in multi-agent problems. Additionally, the concurrent training of the policies of multiple agents during training can render the optimisation non-stationary. This can lead to unstable value estimates, highly variant gradients, and ultimately hinder coordination between agents. To address these challenges, we propose ensemble value functions for multi-agent exploration (EMAX). EMAX is a framework to seamlessly extend value-based MARL algorithms. EMAX leverages an ensemble of value functions for each agent to guide their exploration, reduce the variance of their optimisation, and makes their policies more robust to miscoordination. EMAX achieves these benefits by (1) systematically guiding the exploration of agents with a UCB policy towards parts of the environment that require multiple agents to coordinate. (2) EMAX computes average value estimates across the ensemble as target values to reduce the variance of gradients and make optimisation more stable. (3) During evaluation, EMAX selects actions following a majority vote across the ensemble to reduce the likelihood of miscoordination. We first instantiate independent DQN with EMAX and evaluate it in 11 general-sum tasks with sparse rewards. We show that EMAX improves final evaluation returns by 185% across all tasks. We then evaluate EMAX on top of IDQN, VDN and QMIX in 21 common-reward tasks, and show that EMAX improves sample efficiency and final evaluation returns across all tasks over all three vanilla algorithms by 60%, 47%, and 538%, respectively.
-
Decoupled Reinforcement Learning to Stabilise Intrinsically-Motivated ExplorationInternational Conference on Autonomous Agents and Multiagent Systems · AAMAS · 2022Intrinsic rewards can improve exploration in reinforcement learning, but the exploration process may suffer from instability caused by non-stationary reward shaping and strong dependency on hyperparameters. In this work, we introduce Decoupled RL (DeRL) as a general framework which trains separate policies for intrinsically-motivated exploration and exploitation. Such decoupling allows DeRL to leverage the benefits of intrinsic rewards for exploration while demonstrating improved robustness and sample efficiency. We evaluate DeRL algorithms in two sparse-reward environments with multiple types of intrinsic rewards. Our results show that DeRL is more robust to varying scale and rate of decay of intrinsic rewards and converges to the same evaluation returns than intrinsically-motivated baselines in fewer interactions. Lastly, we discuss the challenge of distribution shift and show that divergence constraint regularisers can successfully minimise instability caused by divergence of exploration and exploitation policies.
-
Benchmarking Multi-Agent Deep Reinforcement Learning Algorithms in Cooperative TasksConference on Neural Information Processing Systems · NeurIPS Datasets & Benchmarks · 2021EPyMARL codebase: 700+ GitHub starsMulti-agent deep reinforcement learning (MARL) suffers from a lack of commonly-used evaluation tasks and criteria, making comparisons between approaches difficult. In this work, we provide a systematic evaluation and comparison of three different classes of MARL algorithms (independent learning, centralised multi-agent policy gradient, value decomposition) in a diverse range of cooperative multi-agent learning tasks. Our experiments serve as a reference for the expected performance of algorithms across different learning tasks, and we provide insights regarding the effectiveness of different learning approaches. We open-source EPyMARL, which extends the PyMARL codebase to include additional algorithms and allow for flexible configuration of algorithm implementation details such as parameter sharing. Finally, we open-source two environments for multi-agent research which focus on coordination under sparse rewards.
-
Journal & Conference PapersUsing Offline Data to Speed-up Reinforcement Learning in Procedurally Generated EnvironmentsNeurocomputing · 2024One of the key challenges of Reinforcement Learning (RL) is the ability of agents to generalise their learned policy to unseen settings. Moreover, training RL agents requires large numbers of interactions with the environment. Motivated by the recent success of Offline RL and Imitation Learning (IL), we conduct a study to investigate whether agents can leverage offline data in the form of trajectories to improve the sample-efficiency in procedurally generated environments. We consider two settings of using IL from offline data for RL: (1) pre-training a policy before online RL training and (2) concurrently training a policy with online RL and IL from offline data. We analyse the impact of the quality (optimality of trajectories) and diversity (number of trajectories and covered level) of available offline trajectories on the effectiveness of both approaches. Across four well-known sparse reward tasks in the MiniGrid environment, we find that using IL for pre-training and concurrently during online RL training both consistently improve the sample-efficiency while converging to optimal policies. Furthermore, we show that pre-training a policy from as few as two trajectories can make the difference between learning an optimal policy at the end of online training and not learning at all. Our findings motivate the widespread adoption of IL for pre-training and concurrent IL in procedurally generated environments whenever offline trajectories are available or can be generated.
-
Scalable Multi-Agent Reinforcement Learning for Warehouse Logistics with Robotic and Human Co-WorkersInternational Conference on Intelligent Robots and Systems · IROS · 2024We consider a warehouse in which dozens of mobile robots and human pickers work together to collect and deliver items within the warehouse. The fundamental problem we tackle, called the order-picking problem, is how these worker agents must coordinate their movement and actions in the warehouse to maximise performance in this task. Established industry methods using heuristic approaches require large engineering efforts to optimise for innately variable warehouse configurations. In contrast, multi-agent reinforcement learning (MARL) can be flexibly applied to diverse warehouse configurations (e.g. size, layout, number/types of workers, item replenishment frequency), and different types of order-picking paradigms (e.g. Goods-to-Person and Person-to-Goods), as the agents can learn how to cooperate optimally through experience. We develop hierarchical MARL algorithms in which a manager agent assigns goals to worker agents, and the policies of the manager and workers are co-trained toward maximising a global objective (e.g. pick rate). Our hierarchical algorithms achieve significant gains in sample efficiency over baseline MARL algorithms and overall pick rates over multiple established industry heuristics in a diverse set of warehouse configurations and different order-picking paradigms.
-
Multi-Horizon Representations with Hierarchical Forward Models for Reinforcement LearningTransactions on Machine Learning Research · TMLR · 2023Learning control from pixels is difficult for reinforcement learning (RL) agents because representation learning and policy learning are intertwined. Previous approaches remedy this issue with auxiliary representation learning tasks, but they either do not consider the temporal aspect of the problem or only consider single-step transitions, which may cause learning inefficiencies if important environmental changes take many steps to manifest. We propose Hierarchical k-Step Latent (HKSL), an auxiliary task that learns multiple representations via a hierarchy of forward models that learn to communicate and an ensemble of n-step critics that all operate at varying magnitudes of step skipping. We evaluate HKSL in a suite of 30 robotic control tasks with and without distractors and a task of our creation. We find that HKSL either converges to higher or optimal episodic returns more quickly than several alternative representation learning approaches. Furthermore, we find that HKSL's representations capture task-relevant details accurately across timescales (even in the presence of distractors) and that communication channels between hierarchy levels organize information based on both sides of the communication process, both of which improve sample efficiency.
-
Robust On-Policy Sampling for Data-Efficient Policy Evaluation in Reinforcement LearningConference on Neural Information Processing Systems · NeurIPS · 2022Reinforcement learning (RL) algorithms are often categorized as either on-policy or off-policy depending on whether they use data from a target policy of interest or from a different behavior policy. In this paper, we study a subtle distinction between on-policy data and on-policy sampling in the context of the RL sub-problem of policy evaluation. We observe that on-policy sampling may fail to match the expected distribution of on-policy data after observing only a finite number of trajectories and this failure hinders data-efficient policy evaluation. Towards improved data-efficiency, we show how non-i.i.d., off-policy sampling can produce data that more closely matches the expected on-policy data distribution and consequently increases the accuracy of the Monte Carlo estimator for policy evaluation. We introduce a method called Robust On-Policy Sampling and demonstrate theoretically and empirically that it produces data that converges faster to the expected on-policy distribution compared to on-policy sampling. Empirically, we show that this faster convergence leads to lower mean squared error policy value estimates.
-
Deep Reinforcement Learning for Multi-Agent InteractionAI Communications · Special Issue on Multi-Agent Systems Research in the UK · 2022The development of autonomous agents which can interact with other agents to accomplish a given task is a core area of research in artificial intelligence and machine learning. Towards this goal, the Autonomous Agents Research Group develops novel machine learning algorithms for autonomous systems control, with a specific focus on deep reinforcement learning and multi-agent reinforcement learning. Research problems include scalable learning of coordinated agent policies and inter-agent communication; reasoning about the behaviours, goals, and composition of other agents from limited observations; and sample-efficient learning based on intrinsic motivation, curriculum learning, causal inference, and representation learning. This article provides a broad overview of the ongoing research portfolio of the group and discusses open problems for future directions.
-
Task Generalisation in Multi-Agent Reinforcement LearningDoctoral Consortium · International Conference on Autonomous Agents and Multiagent Systems · AAMAS · 2022Multi-agent reinforcement learning agents are typically trained in a single environment. As a consequence, they overfit to the training environment which results in sensitivity to perturbations and inability to generalise to similar environments. For multi-agent reinforcement learning approaches to be applicable in real-world scenarios, generalisation and robustness need to be addressed. However, unlike in supervised learning, generalisation lacks a clear definition in multi-agent reinforcement learning. We discuss the problem of task generalisation and demonstrate the difficulty of zero-shot generalisation and finetuning at the example of multi-robot warehouse coordination with preliminary results. Lastly, we discuss promising directions of research working towards generalisation of multi-agent reinforcement learning.
-
Shared Experience Actor-Critic for Multi-Agent Reinforcement LearningConference on Neural Information Processing Systems · NeurIPS · 2020Exploration in multi-agent reinforcement learning is a challenging problem, especially in environments with sparse rewards. We propose a general method for efficient exploration by sharing experience amongst agents. Our proposed algorithm, called Shared Experience Actor-Critic (SEAC), applies experience sharing in an actor-critic framework. We evaluate SEAC in a collection of sparse-reward multi-agent environments and find that it consistently outperforms two baselines and two state-of-the-art algorithms by learning in fewer steps and converging to higher returns. In some harder environments, experience sharing makes the difference between learning to solve the task and not learning at all.
-
WorkshopsLearning Task Embeddings for Teamwork Adaptation in Multi-Agent Reinforcement LearningWorkshop on Generalization in Planning · NeurIPS · 2023Successful deployment of multi-agent reinforcement learning often requires agents to adapt their behaviour. In this work, we discuss the problem of teamwork adaptation in which a team of agents needs to adapt their policies to solve novel tasks with limited fine-tuning. Motivated by the intuition that agents need to be able to identify and distinguish tasks in order to adapt their behaviour to the current task, we propose to learn multi-agent task embeddings (MATE). These task embeddings are trained using an encoder-decoder architecture optimised for reconstruction of the transition and reward functions which uniquely identify tasks. We show that a team of agents is able to adapt to novel tasks when provided with task embeddings. We propose three MATE training paradigms: independent MATE, centralised MATE, and mixed MATE which vary in the information used for the task encoding. We show that the embeddings learned by MATE identify tasks and provide useful information which agents leverage during adaptation to novel tasks.
Experience
Education
Teaching
Community
- 2024 – 2025 Co-lead organiser of the UK Multi-Agent Systems Symposium 2025 (~200 participants) in collaboration with the Alan Turing Institute and King’s College London.
- 2020 – 2022 Lead organiser and host of RL reading group at University of Edinburgh with speakers from leading industry and academic research groups.
- 2026 ICLR, ICML
- 2025 ICML (best reviewer award), RLDM
- 2024 ICML (best reviewer award), RLC, AAMAS, TMLR
- 2023 NeurIPS, NeurIPS Datasets & Benchmarks Track, ICML, AAMAS
- 2022 NeurIPS, NeurIPS Datasets & Benchmarks Track, ICML (best reviewer award), AAMAS
- 2021 NeurIPS, ICML