Author Image

Hi, I am Lukas

Lukas Schäfer

PhD Student at University of Edinburgh

I am a Data Science and Artificial Intelligence PhD student from Germany working on multi-agent reinforcement learning at the University of Edinburgh, where I am supervised by Stefano Albrecht and Amos Storkey.

Stefano, Filippos and myself wrote a textbook on multi-agent reinforcement learning. The book will be published with MIT Press and is already available at!

Reinforcement learning is often sample inefficient, and prone to overfitting, leading to behaviour which often does not generalise across tasks. These challenges are further exacerbated in multi-agent systems in which multiple agents interact with each other in a shared environment. Guided by these challenges of efficiency and generalisation, my PhD research focuses on efficient and generalisable reinforcement learning in multi-agent systems.

Previously, I interned at at Microsoft Research Cambridge, where I worked with the Game Intelligence lab under the supervision of Sam Devlin, at Huawei Noah’s Ark Lab and Dematic. At MSR, I researched visual encoders for imitation learning in video games, including vision foundation models. At Huawei, I was part of the multi-agent team researching novel exploration techniques for multi-agent reinforcement learning under the supervision of David Mguni. At Dematic, I researched and developed AI solutions for large-scale warehouse automation.

Contact: l.schaefer [at]


Dec 06, 2023

Nov 01, 2023

📢 It is DONE! Our textbook “Multi-Agent Reinforcement Learning: Foundations and Modern Approaches” is now with MIT Press and available on the official webpage! The print release is scheduled for late 2024.

Oct 30, 2023

📃 Our work Learning Task Embeddings for Teamwork Adaptation in Multi-Agent Reinforcement Learning will be presented at the Workshop on Generalization in Planning at the Conference on Neural Information Processing Systems (NeurIPS) 2023!

May 29, 2023

📢 I`m very excited to announce that the first pre-print non-final PDF of our book Multi-Agent Reinforcement Learning: Foundations and Modern Approaches is now released and available on the official webpage!

Apr 14, 2023

📃 Our works, Ensemble Value Functions for Efficient Exploration in Multi-Agent Reinforcement Learning and Using Offline Data to Speed-up Reinforcement Learning in Procedurally Generated Environments, will be presented at the Adaptive and Learning Agents (ALA) Workshop at the International Conference on Autonomous Agents and Multi-Agent Systems (AAMAS) 2023!

Apr 03, 2023

📢 I’m very excited to start a research internship with the Game Intelligence lab at Microsoft Research Cambridge!

Sep 14, 2022

📃 Our work, Robust On-Policy Sampling for Data-Efficient Policy Evaluation in Reinforcement Learning, has been accepted at the Neural Information Processing Systems Conference (NeurIPS) 2022!

Aug 27, 2022

📢 Excited to announce that I was selected to attend the upcoming 9th Heidelberg Laureate Forum!


Multi-Agent Reinforcement Learning: Foundations and Modern Approaches

This book provides a comprehensive introduction to multi-agent reinforcement learning (MARL), a rapidly growing field that combines insights from game theory, reinforcement learning, and multi-agent systems. The book covers the foundations of MARL, including the key concepts, algorithms, and challenges, and presents a detailed overview of contemporary approaches of deep MARL research.

Video games have served as useful benchmarks for the decision making community, but going beyond Atari games towards training agents in modern games has been prohibitively expensive for the vast majority of the research community. Recent progress in the research, development and open release of large vision models has the potential to amortize some of these costs across the community. However, it is currently unclear which of these models have learnt representations that retain information critical for sequential decision making. Towards enabling wider participation in the research of gameplaying agents in modern games, we present a systematic study of imitation learning with publicly available visual encoders compared to the typical, task-specific, end-to-end training approach in Minecraft, Minecraft Dungeons and Counter-Strike: Global Offensive.

Successful deployment of multi-agent reinforcement learning often requires agents to adapt their behaviour. In this work, we discuss the problem of teamwork adaptation in which a team of agents needs to adapt their policies to solve novel tasks with limited fine-tuning. Motivated by the intuition that agents need to be able to identify and distinguish tasks in order to adapt their behaviour to the current task, we propose to learn multi-agent task embeddings (MATE). These task embeddings are trained using an encoder-decoder architecture optimised for reconstruction of the transition and reward functions which uniquely identify tasks. We show that a team of agents is able to adapt to novel tasks when provided with task embeddings. We propose three MATE training paradigms: independent MATE, centralised MATE, and mixed MATE which vary in the information used for the task encoding. We show that the embeddings learned by MATE identify tasks and provide useful information which agents leverage during adaptation to novel tasks.

Cooperative multi-agent reinforcement learning (MARL) requires agents to explore to learn to cooperate. Existing value-based MARL algorithms commonly rely on random exploration, such as ϵ-greedy, which is inefficient in discovering multi-agent cooperation. Additionally, the environment in MARL appears non-stationary to any individual agent due to the simultaneous training of other agents, leading to highly variant and thus unstable optimisation signals. In this work, we propose ensemble value functions for multi-agent exploration (EMAX), a general framework to extend any value-based MARL algorithm. EMAX trains ensembles of value functions for each agent to address the key challenges of exploration and non-stationarity: (1) The uncertainty of value estimates across the ensemble is used in a UCB policy to guide the exploration of agents to parts of the environment which require cooperation. (2) Average value estimates across the ensemble serve as target values. These targets exhibit lower variance compared to commonly applied target networks and we show that they lead to more stable gradients during the optimisation. We instantiate three value-based MARL algorithms with EMAX, independent DQN, VDN and QMIX, and evaluate them in 21 tasks across four environments. Using ensembles of five value functions, EMAX improves sample efficiency and final evaluation returns of these algorithms by 53%, 36%, and 498%, respectively, averaged all 21 tasks.

One of the key challenges of Reinforcement Learning (RL) is the ability of agents to generalise their learned policy to unseen settings. Moreover, training RL agents requires large numbers of interactions with the environment. Motivated by the recent success of Offline RL and Imitation Learning (IL), we conduct a study to investigate whether agents can leverage offline data in the form of trajectories to improve the sample-efficiency in procedurally generated environments. We consider two settings of using IL from offline data for RL: (1) pre-training a policy before online RL training and (2) concurrently training a policy with online RL and IL from offline data. We analyse the impact of the quality (optimality of trajectories) and diversity (number of trajectories and covered level) of available offline trajectories on the effectiveness of both approaches. Across four well-known sparse reward tasks in the MiniGrid environment, we find that using IL for pre-training and concurrently during online RL training both consistently improve the sample-efficiency while converging to optimal policies. Furthermore, we show that pre-training a policy from as few as two trajectories can make the difference between learning an optimal policy at the end of online training and not learning at all. Our findings motivate the widespread adoption of IL for pre-training and concurrent IL in procedurally generated environments whenever offline trajectories are available or can be generated.

Multi-Horizon Representations with Hierarchical Forward Models for Reinforcement Learning

Learning control from pixels is difficult for reinforcement learning (RL) agents because representation learning and policy learning are intertwined. Previous approaches remedy this issue with auxiliary representation learning tasks, but they either do not consider the temporal aspect of the problem or only consider single-step transitions. Instead, we propose Hierarchical k-Step Latent (HKSL), an auxiliary task that learns representations via a hierarchy of forward models that operate at varying magnitudes of step skipping while also learning to communicate between levels in the hierarchy. We evaluate HKSL in a suite of 30 robotic control tasks and find that HKSL either reaches higher episodic returns or converges to maximum performance more quickly than several current baselines. Also, we find that levels in HKSL’s hierarchy can learn to specialize in long- or short-term consequences of agent actions, thereby providing the downstream control policy with more informative representations. Finally, we determine that communication channels between hierarchy levels organize information based on both sides of the communication process, which improves sample efficiency.

Robust On-Policy Sampling for Data-Efficient Policy Evaluation in Reinforcement Learning

Reinforcement learning (RL) algorithms are often categorized as either on-policy or off-policy depending on whether they use data from a target policy of interest or from a different behavior policy. In this paper, we study a subtle distinction between on-policy data and on-policy sampling in the context of the RL sub-problem of policy evaluation. We observe that on-policy sampling may fail to match the expected distribution of on-policy data after observing only a finite number of trajectories and this failure hinders data-efficient policy evaluation. Towards improved data-efficiency, we show how non-i.i.d., off-policy sampling can produce data that more closely matches the expected on-policy data distribution and consequently increases the accuracy of the Monte Carlo estimator for policy evaluation. We introduce a method called Robust On-Policy Sampling and demonstrate theoretically and empirically that it produces data that converges faster to the expected on-policy distribution compared to on-policy sampling. Empirically, we show that this faster convergence leads to lower mean squared error policy value estimates.

Multi-agent reinforcement learning agents are typically trained in a single environment. As a consequence, they overfit to the training environment which results in sensitivity to perturbations and inability to generalise to similar environments. For multi-agent reinforcement learning approaches to be applicable in real-world scenarios, generalisation and robustness need to be addressed. However, unlike in supervised learning, generalisation lacks a clear definition in multi-agent reinforcement learning. We discuss the problem of task generalisation and demonstrate the difficulty of zero-shot generalisation and finetuning at the example of multi-robot warehouse coordination with preliminary results. Lastly, we discuss promising directions of research working towards generalisation of multi-agent reinforcement learning.

Intrinsic rewards can improve exploration in reinforcement learning, but the exploration process may suffer from instability caused by non-stationary reward shaping and strong dependency on hyperparameters. In this work, we introduce Decoupled RL (DeRL) as a general framework which trains separate policies for intrinsically-motivated exploration and exploitation. Such decoupling allows DeRL to leverage the benefits of intrinsic rewards for exploration while demonstrating improved robustness and sample efficiency. We evaluate DeRL algorithms in two sparse-reward environments with multiple types of intrinsic rewards. Our results show that DeRL is more robust to varying scale and rate of decay of intrinsic rewards and converges to the same evaluation returns than intrinsically-motivated baselines in fewer interactions. Lastly, we discuss the challenge of distribution shift and show that divergence constraint regularisers can successfully minimise instability caused by divergence of exploration and exploitation policies.

The development of autonomous agents which can interact with other agents to accomplish a given task is a core area of research in artificial intelligence and machine learning. Towards this goal, the Autonomous Agents Research Group develops novel machine learning algorithms for autonomous systems control, with a specific focus on deep reinforcement learning and multi-agent reinforcement learning. Research problems include scalable learning of coordinated agent policies and inter-agent communication; reasoning about the behaviours, goals, and composition of other agents from limited observations; and sample-efficient learning based on intrinsic motivation, curriculum learning, causal inference, and representation learning. This article provides a broad overview of the ongoing research portfolio of the group and discusses open problems for future directions.

We envision a warehouse in which dozens of mobile robots and human pickers work together to collect and deliver items within the warehouse. The fundamental problem we tackle, called the order-picking problem, is how these worker agents must coordinate their movement and actions in the warehouse to maximise performance (e.g. order throughput). Established industry methods using heuristic approaches require large engineering efforts to optimise for innately variable warehouse configurations. In contrast, multi-agent reinforcement learning (MARL) can be flexibly applied to diverse warehouse configurations (e.g. size, layout, number/types of workers, item replenishment frequency), as the agents learn through experience how to optimally cooperate with one another. We develop hierarchical MARL algorithms in which a manager assigns goals to worker agents, and the policies of the manager and workers are co-trained toward maximising a global objective (e.g. pick rate). Our hierarchical algorithms achieve significant gains in sample efficiency and overall pick rates over baseline MARL algorithms in diverse warehouse configurations, and substantially outperform two established industry heuristics for order-picking systems.

Benchmarking Multi-Agent Deep Reinforcement Learning Algorithms in Cooperative Tasks

Multi-agent deep reinforcement learning (MARL) suffers from a lack of commonly-used evaluation tasks and criteria, making comparisons between approaches difficult. In this work, we consistently evaluate and compare three different classes of MARL algorithms (independent learning, centralised multi-agent policy gradient, value decomposition) in a diverse range of cooperative multi-agent learning tasks. Our experiments serve as a reference for the expected performance of algorithms across different learning tasks, and we provide insights regarding the effectiveness of different learning approaches. We open-source EPyMARL, which extends the PyMARL codebase to include additional algorithms and allow for flexible configuration of algorithm implementation details such as parameter sharing. Finally, we open-source two environments for multi-agent research which focus on coordination under sparse rewards.

Intrinsic rewards are commonly applied to improve exploration in reinforcement learning. However, these approaches suffer from instability caused by non-stationary reward shaping and strong dependency on hyperparameters. In this work, we propose Decoupled RL (DeRL) which trains separate policies for exploration and exploitation. DeRL can be applied with on-policy and off-policy RL algorithms. We evaluate DeRL algorithms in two sparse-reward environments with multiple types of intrinsic rewards. We show that DeRL is more robust to scaling and speed of decay of intrinsic rewards and converges to the same evaluation returns than intrinsically motivated baselines in fewer interactions.

This paper considers how to complement offline reinforcement learning (RL) data with additional data collection for the task of policy evaluation. In policy evaluation, the task is to estimate the expected return of an evaluation policy on an environment of interest. Prior work on offline policy evaluation typically only considers a static dataset. We consider a setting where we can collect a small amount of additional data to combine with a potentially larger offline RL dataset. We show that simply running the evaluation policy – on-policy data collection – is sub-optimal for this setting. We then introduce two new data collection strategies for policy evaluation, both of which consider previously collected data when collecting future data so as to reduce distribution shift (or sampling error) in the entire dataset collected. Our empirical results show that compared to on-policy sampling, our strategies produce data with lower sampling error and generally lead to lower mean-squared error in policy evaluation for any total dataset size. We also show that these strategies can start from initial off-policy data, collect additional data, and then use both the initial and new data to produce low mean-squared error policy evaluation without using off-policy corrections.

Multi-agent deep reinforcement learning (MARL) suffers from a lack of commonly-used evaluation tasks and criteria, making comparisons between approaches difficult. In this work, we evaluate and compare three different classes of MARL algorithms (independent learning, centralised multi-agent policy gradient, and value decomposition) in a diverse range of fully-cooperative multi-agent learning tasks. Our experiments can serve as a reference for the expected performance of algorithms across different learning tasks. We also provide further insight about (1) when independent learning might be surprisingly effective despite non-stationarity, (2) when centralised training should (and shouldn’t) be applied and (3) which benefits value decomposition can bring.

Learning Temporally-Consistent Representations for Data-Efficient Reinforcement Learning
arXiv 2021

Deep reinforcement learning (RL) agents that exist in high-dimensional state spaces, such as those composed of images, have interconnected learning burdens. Agents must learn an action-selection policy that completes their given task, which requires them to learn a representation of the state space that discerns between useful and useless information. The reward function is the only supervised feedback that RL agents receive, which causes a representation learning bottleneck that can manifest in poor sample efficiency. We present k-Step Latent (KSL), a new representation learning method that enforces temporal consistency of representations via a self-supervised auxiliary task wherein agents learn to recurrently predict action-conditioned representations of the state space. The state encoder learned by KSL produces low-dimensional representations that make optimization of the RL task more sample efficient. Altogether, KSL produces state-of-the-art results in both data efficiency and asymptotic performance in the popular PlaNet benchmark suite. Our analyses show that KSL produces encoders that generalize better to new tasks unseen during training, and its representations are more strongly tied to reward, are more invariant to perturbations in the state space, and move more smoothly through the temporal axis of the RL problem than other methods such as DrQ, RAD, CURL, and SAC-AE.

Shared Experience Actor-Critic for Multi-Agent Reinforcement learning

Exploration in multi-agent reinforcement learning is a challenging problem, especially in environments with sparse rewards. We propose a general method for efficient exploration by sharing experience amongst agents. Our proposed algorithm, called Shared Experience Actor-Critic (SEAC), applies experience sharing in an actor-critic framework. We evaluate SEAC in a collection of sparse-reward multi-agent environments and find that it consistently outperforms two baselines and two state-of-the-art algorithms by learning in fewer steps and converging to higher returns. In some harder environments, experience sharing makes the difference between learning to solve the task and not learning at all.


Microsoft Research

Apr 2023 - Oct 2023


Research Scientist Intern

Apr 2023 - Oct 2023

  • Researching visual encoders for imitation learning in video games under the supervision of Sam Devlin.

Heidelberg Laureate Forum

Sep 2022 - Sep 2022


The Heidelberg Laureate Forum brings together the most exceptional mathematicians and computer scientists of their generations. Each year, the recipients of the most prestigious awards in mathematics and computer science, the Abel Prize, ACM A.M. Turing Award, ACM Prize in Computing, Fields Medal, IMU Abacus Medal and Nevanlinna Prize, meet 200 selected young researchers from all over the world. Participants spend a week interacting and networking in a relaxed atmosphere designed to encourage scientific exchange.

Young Research Attendee

Sep 2022 - Sep 2022


Huawei Noah's Ark Lab

Jul 2022 - Dec 2022


The Noah’s Ark Lab is the AI research center for Huawei Technologies, working towards significant contributions to both the company and society by innovating in artificial intelligence, data mining and related fields.

Research Scientist Intern

Jul 2022 - Dec 2022

  • Researched ensemble models for exploration in multi-agent reinforcement learning with the RL and multi-agent team under the supervision of David Mguni.
  • Submitted a publication as the result of the internship to a top-tier machine learning conference (under review). A preprint is available on arXiv (


Nov 2020 - Mar 2021


Dematic is global player focused on design and implementation of automated system solutions for warehouses, distribution centres and production facilities.

Research Intern

Nov 2020 - Mar 2021

  • Applying state-of-the-art AI technology to enable a prototype for automation of large-scale robotic warehouse logistics.


Sep 2018 - Aug 2020


HYPED is a team of students at the University of Edinburgh dedicated to developing the Hyperloop concept and inspiring future generations about engineering. HYPED has received awards from SpaceX, Virgin Hyperloop One and Institution of Civil Engineers.

Navigation Advisor

Sep 2019 - Aug 2020

  • Advising navigation team on the adaptation and implementation of improved sensor and filtering techniques
Navigation Engineer

Sep 2018 - Aug 2019

  • Developing navigation system of “The Flying Podsman” Hyperloop prototype using sensor filtering, processing and control techniques to estimate location, orientation and speed of the pod
  • Finalist for the SpaceX 2019 Hyperloop competition in California in Summer 2019


Ph.D in Data Science and Artificial Intelligence
Efficient Exploration in Single-Agent and Multi-Agent Deep Reinforcement Learning
Stefano V. Albrecht (primary) and Amos Storkey (secondary)
Principal’s Career Development Scholarship from the University of Edinburgh
Reinforcement Learning, Multi-Agent Systems, Generalisation, Exploration, Intrinsic Rewards
M.Sc. in Informatics
CGPA: 77.28% out of
Taken Courses:
Course NameObtained Credit
Reinforcement Learning82%
Algorithmic Game Theory and its Applications98%
Machine Learning and Pattern Recognition64%
Probabilistic Modelling and Reasoning75%
Decision Making in Robots and Autonomous Agents86%
Robotics: Science and Systems87%
Natural Computing84%
Informatics Project Proposal73%
Informatics Research Review72%
Extracurricular Activities:
  • Active position as navigation engineer for HYPED.
  • Participation in GEAS roleplaying society.
  • Participation in EUKC - Edinburgh University Kendo Club.
DAAD (German Academic Exchange Service) graduate scholarship & Stevenson Exchange Scholarship
B.Sc. in Informatics
German scale: 1.2 out of
Taken Courses:
Course NameObtained Credit
Automated Planning1.0
Admissible Search Enhancements1.0
Information Retrieval and Data Mining1.7
Neural Networks: Implementation and Application2.0
Artificial Intelligence1.7
Software Engineering1.3
Modern Imperative Programming Languages1.3
Concurrent Programming2.7
Fundamentals of Data Structures and Algorithms1.7
Information Systems1.3
Introduction to Theoretical Computer Science1.0
System Architecture1.0
Mathematics for Computer Scientists I1.0
Mathematics for Computer Scientists II2.3
Mathematics for Computer Scientists III1.7
Programming I1.0
Programming II1.0
Japanese Foundations - Shokyu I1.3
Japanese Foundations - Shokyu II2.0
Japanese Applied Geography1.0
Japanese History II1.0
Extracurricular Activities:
  • Japanese language and cultural studies as minor subject.

Teaching Experience

Oct 2019 - June 2022, School of Informatics, University of Edinburgh

Teaching assistant, demonstrator and marker for three iterations of the Reinforcement Learning lecture at the University of Edinburgh under Dr. Stefano V. Albrecht

  • Holding lectures on implementation of RL systems and Deep RL
  • Designing RL project covering wide range of topics including dynamic programming, single- and multi-agent RL as well as deep RL
  • Marking project and exam for reinforcement learning course
  • Advising students on various challenges regarding lecture material and content

Jun 2022 - Present, School of Informatics, University of Edinburgh

Co-supervised visiting PhD student project at the University of Edinburgh

  • Supervision through regular meetings discussing research project and ideating novel solutions
  • Assisted project towards a successful workshop publication at the ALA workshop at AAMAS 2023

Feb 2021 - Aug 2021, School of Informatics, University of Edinburgh

Co-supervised final Masters students’ projects at the University of Edinburgh

  • Co-supervised two M.Sc. students through project proposal, refinement and execution towards final thesis
  • Assisted M.Sc. student from their thesis towards a successful workshop publication at NeurIPS 2021, and a successful main conference publication at NeurIPS 2022.

Sep 2017 - Oct 2017, Mathematics Preparation Course, Saarland University

Voluntary lecturer and coach for the mathematics preparation course preparing upcoming computer science undergraduate students for their studies

  • Assisted the organisation of the mathematics preparation course for upcoming computer science students aiming to introduce them to foundational mathematical concepts, the university and student life as a whole
  • Introduced ∼250 participants to the importance of mathematics for computer science, formal languages and predicate logic in daily lectures of the first week
  • Supervised two groups to provide feedback and further assistance in daily coaching-sessions
  • The course received the BESTE-award for special student commitment 2017 at Saarland University

Oct 2016 - Mar 2017, Dependable Systems and Software Chair, Saarland University

Tutor for the Programming 1 lecture about functional programming at the Dependable Systems and Software Group chair of Saarland University under Prof. Dr. Holger Hermanns

  • Taught first-year students fundamental concepts of functional programming, basic complexity theory and inductive correctness proofs in weekly tutorials and office hours
  • Corrected weekly tests as well as mid- and endterm exams
  • Collectively created learning materials and discussed student progress as part of the whole teaching team


2023: NeurIPS, NeurIPS Datasets and Benchmark Track, ICML, AAMAS
2022: NeurIPS, NeurIPS Datasets and Benchmark Track, ICML (top 10% outstanding reviewer award), AAMAS
2021: NeurIPS
2020: Pre-registration experiment workshop at NeurIPS
2024: Transactions on Machine Learning Research (TMLR)