Challenges and Opportunities in Multi-Agent Reinforcement Learning

Reinforcement Learning (RL) is a Machine Learning area that studies sequential decision-making problems, where a learning agent interacts with an unknown environment in order to maximize its rewards. In recent years, RL methods have made substantial progress in solving real-world problems. However, the most successful applications, such as beating the world champion player of Go, solving robotic control problems, managing the power consumption of households, and achieving promising results in autonomous driving, involve more than one agent and can be cast in the Multi-Agent Reinforcement Learning (MARL) setting. However, although the MARL setting is an important research area of practical interest, this framework is still poorly understood from a theoretical point of view. In general, the presence of many agents makes the learning problem more complex, and in many situations, single RL algorithms cannot be applied. In this thesis, we take a step toward solving this problem, providing theoretically sound algorithms for this setting. We analyze the challenges and opportunities that a multi-agent environment creates in the RL framework, providing new approaches in three RL sub-problems while also showing how they are interconnected. The contributions of the thesis are theoretical, algorithmic, and experimental. We take inspiration from practical problems, we design new algorithms with desirable theoretical properties to solve them, and we show their performances in benchmarks domains and on real-world data. The thesis is divided into four parts. In the first part, we provide the background and preliminaries necessary to follow the rest of the thesis. We start by introducing the RL problem and classical algorithms to solve it. Then, we introduce the Inverse RL problem, i.e., the problem of recovering reward functions from an expert's demonstrations. Finally, we provide the necessary background on game theory and MARL. In the second part, we analyze how the presence of multiple agents affects the Inverse Reinforcement Learning problem and objective. We provide two novel algorithms: the first considers how to recover and cluster the intentions of (i.e., the rewards optimized by) a set of agents given demonstrations of near-optimal behavior; the second aims at inferring the reward function optimized by an agent while observing its actual learning process. The experimental evaluation is conducted on synthetic problems and two real-world problems. We then show the importance of learning (or knowing) the other agent's intention to construct efficient algorithms. In particular, in the third part, we study online learning in the MARL scenario, showing how the presence of other agents can increase the hardness of the problem while proposing statistically efficient algorithms. We design two algorithms to solve the Configurable MDP problem, a setting where an external entity can partially control the transition model. Then, we analyze the statistical limits of general-sum stochastic games when we control only one agent, providing a new lower bound and a near-optimal algorithm. Finally, in the fourth part, we study MARL from an optimization viewpoint while showing the difficulties that arise from multiple function optimization problems. Then, we present a new algorithm for this scenario, providing convergence results and extensively evaluating it against SotA baselines.

L’apprendimento per rinforzo (chiamato Reinforcement Learning, RL) è un'area del Machine Learning che studia i problemi decisionali sequenziale, in cui un agente RL interagisce con un ambiente sconosciuto al fine di massimizzare le sue ricompense. Negli ultimi anni, i metodi RL hanno compiuto progressi sostanziali nella risoluzione dei problemi del mondo reale. Tuttavia, le applicazioni di maggior successo, come vincere contro il campione del mondo di Go, risolvere problemi di controllo robotico, gestire automaticamente l’energia di una casa e promettenti risultati nella guida autonoma, coinvolgono più di un agente e possono essere inseriti nel contesto dell’apprendimento per rinforzo multi-agente (chiamato Multi-Agent Reinforcement Learning, MARL). Tuttavia, sebbene il setting MARL sia un'importante area di ricerca di interesse pratico, questo settore è ancora poco compreso da un punto di vista teorico. In generale, la presenza di molti agenti rende il problema di apprendimento più complesso e in molte situazioni non è possibile applicare algoritmi creati per problemi RL singolo agente. In questa tesi, facciamo un passo verso la risoluzione di questo problema, fornendo algoritmi teoricamente validi. Analizziamo le sfide e le opportunità che un ambiente multi-agente crea nel framework RL, fornendo nuovi approcci in tre sotto-problemi RL e mostrando anche come essi sono interconnessi. I contributi della tesi sono teorici, algoritmici e sperimentali. Prendiamo ispirazione da problemi pratici, progettiamo nuovi algoritmi con proprietà teoriche desiderabili per risolverli e mostriamo le loro prestazioni in domini simulati e su dati del mondo reale. La tesi è suddivisa in quattro parti. Nella prima parte forniamo i preliminari necessari per seguire il resto della tesi. Iniziamo introducendo il problema RL e gli algoritmi classici per risolverlo. Quindi, introduciamo il problema RL inverso, ovvero il problema del recupero delle funzioni di ricompensa dalle dimostrazioni di un esperto. Infine, forniamo il background necessario sulla teoria dei giochi e MARL. Nella seconda parte, analizziamo come la presenza di più agenti influenzi il problema e l'obiettivo di apprendimento per rinforzo inverso. Forniamo due nuovi algoritmi: il primo considera come recuperare e raggruppare le intenzioni di (cioè le ricompense ottimizzate da) un insieme di agenti che hanno dato dimostrazioni di comportamento ottimale; il secondo mira a dedurre la funzione di ricompensa ottimizzata da un agente osservando il suo effettivo processo di apprendimento. La valutazione sperimentale è condotta su problemi sintetici e su due problemi del mondo reale. Mostriamo quindi l'importanza di apprendere (o conoscere) l'intenzione dell'altro agente per costruire algoritmi efficienti. In particolare, nella terza parte, si studia l'apprendimento online nello scenario MARL, mostrando come la presenza di altri agenti possa aumentare la complessità del problema. Progettiamo due algoritmi per risolvere il problema MDP configurabile, un'impostazione in cui un'entità esterna può controllare parzialmente il modello di transizione. Successivamente, analizziamo i limiti statistici dei giochi stocastici a somma generale quando controlliamo un solo agente, fornendo un nuovo limite inferiore e un algoritmo quasi ottimale. Infine, nella quarta parte, studiamo il problema MARL dal punto di vista dell'ottimizzazione, mostrando le difficoltà che derivano da problemi di ottimizzazione di molteplici funzioni. Quindi, presentiamo un nuovo algoritmo per questo scenario, fornendo risultati di convergenza e valutandolo ampiamente rispetto alle linee di base SotA.