Opponent identification in multi-agent reinforcement learning

Imitation Learning is the problem of recovering information about other agents' strategies and goals. The research in this area mainly focuses on imitating the demonstrations of expert agents. However, in Multi-Agent environments, the agents usually learn simultaneously, thus, imitating a non-optimal policy does not provide the information desired and, in most cases, does not lead to high payoffs. Recent works overcame this assumption and developed imitation techniques on learning agents in Single-Agent environments. In this work, evolving from these results, we develop a technique to estimate the reward function of the opponent agent in a Multi-Agent environment where the agents are still learning. We present the developed approach and the results obtained by applying the method in different scenarios. Moreover, during the learning phase, agents update their strategies following specific algorithms. In this document, within a Multi-Agent environment, we present a technique to identify the algorithm used by the other agent from a finite set of possible algorithms. Above this technique, we further develop an exploration strategy to facilitate the identification of the algorithm, maximizing the Kullback-Leibler divergence of the estimated future strategies of the agent following the different algorithms. We present the results of these identification techniques over a set of gradient-based algorithms and the result of a possible application in which the agent plays a Best Response strategy to the identified opponent.

L'apprendimento per imitazione rappresenta il problema di recuperare le informazioni riguardo le strategie e gli obiettivi degli altri agenti. La ricerca in quest'area si è concentrata principalmente nell'imitazione delle dimostrazioni di agenti esperti. Ciononostante, negli ambienti Multi-Agente, normalmente gli agenti apprendono in simultanea, quindi, imitare strategie sub-ottimali non fornisce le informazioni volute e, nella maggior parte dei casi, non porta a maggiori ricompense. Recenti lavori hanno superato questa supposizione e hanno sviluppato tecniche di imitazione di agenti in fase di apprendimento in ambienti a Singolo-Agente. In questo lavoro, costruendo sopra questi risultati, abbiamo sviluppato una tecnica per stimare la funzione di rinforzo dell'agente avversario in un ambiente Multi-Agente, dove gli agenti stanno ancora apprendendo. Presentiamo l'approccio sviluppato e i risultati ottenuti applicando il metodo in vari scenari. Inoltre, durante la fase di apprendimento, gli agenti aggiornano le loro strategie seguendo algoritmi specifici. In questo testo, considerando un ambiente Multi-Agente, presentiamo una tecnica per identificare l'algoritmo usato dall'altro agente all'interno di un set finito di possibili algoritmi. Partendo da questa tecnica, abbiamo sviluppato una strategia di esplorazione che faciliti l'identificazione dell'algoritmo attraverso la massimizzazione della divergenza di Kullback-Leibler tra le strategie future stimate che l'agente avrà seguendo i vari algoritmi. Presentiamo i risultati di queste tecniche di identificazione applicate ad un set di algoritmi a gradiente e il risultato di una possibile applicazione in cui l'agente gioca seguendo la miglior strategia di risposta nei confronti dell'avversario identificato.