Unsupervised pre-training for reinforcement learning via recursive history encoders

Reinforcement Learning focuses on Sequential Decision-Making problems, where an agent interacts with an environment to learn how to reach a goal. The Markov Decision Process (MDP) framework is used to model these problems. A solution to an MDP is a behaviour that optimizes the performance of an agent. The performance is measured using a reward signal which is feedback from the environment to the agent. This is based on the idea of Reinforcement (rewards and punishments) which is used in training animals. This learning process is determined by a learning algorithm that starts from an agent with an initial behaviour and updates it according to experience to maximize its performance. The traditional formulation of an MDP comprises an environment that is fixed, stationary and out of the control of the agent. In this work, we consider an extension of standard MDPs, Multiple Environments MDPs. The agent finds itself in one within a group of environments, and it has to maximize its performance for all of them. In this thesis, we consider the problem of unsupervised pre-training in RL using non Markovian policies. The typical learning process starts from an initial behaviour, often random, and learns interacting with the environment and collecting rewards. This process can be inefficient when the reward is sparse and difficult to collect. The goal of unsu pervised pre-training is to learn skills in the environment to define a starting behaviour without collecting rewards from the environment. An initial behaviour defined in this way can make the learning process of a subsequent task faster and more efficient. This thesis proposes a new neural architecture used to represent non-Markovian be haviours, which depends on the story and not just on the current state of the environ ment. The policy that our algorithm, History Mepol (HMepol) learns, retains a limited size representation of the story in which it encodes only the information needed for fu ture decisions. Having introduced this architecture, we then provide empirical results to show its performance and then we will compare those results with other state-of-the-art algorithms and highlight the advantages of HMepol.

L’Apprendimento per Rinforzo è incentrato sui problemi sequenziale di decisione, dove un agente interagisce con un ambiente per raggiungere un obiettivo. Il formalismo dei Processi decisionali di Markov (MDP) è usato per modellare questi problemi. Una soluzione di un MDP è un comportamento di un agente che ottimizza le proprie performance. Le performance di un agente sono misurate usando un segnale retroattivo di premio fornito dall’ ambiente. Questo apprendimento trova le sua fondamenta nell’idea del Rinforzo (premio o punizione) che viene utilizzata nell’adderstramento degli animali. Il processo parte da un agente con un comportamento iniziale che viene aggiornato e migliorato al fine di massimizzare le proprie performance. La formulazione tradizionale prevede la presenza di un solo ambiente, fisso e stazionario, fuori dal controllo diretto dell’agente. All’interno di questa tesi verrà considerata l’estensione dei MDP multi ambiente. L’agente si trova in uno tra numerosi ambienti, e il suo compito è massimizzare le sue performance per ognuno di essi. In questa tesi lavoreremo nel sottocampo dell’apprendimento non supervisionato per l’apprendimento per rinforzo usando politiche non Markoviane. Il tipico processo di apprendimento parte da un comportamento iniziale, spesso casuale, che viene miglio rato interagendo con l’ambiente e raccogliendo segnali di rinforzo. Questo processo può risultare molto inefficiente quando il segnale è sparso o difficile da trovare. L’obiettivo dell’apprendimento non supervisionato è imparare diverse abilità utili per esplorare un ambiente definendo così un compotamento iniziale senza segnali di rinforzo. Un tale comportamento permette un successivo apprendimento più rapido ed efficiente. Questa tesi propone una nuova architettura neurale usata per rappresentare comporta menti non Markoviani, le cui scelte dipendono dalla storia e non solo dalle condizioni attuali dell’ambiente. La politica, imparata dal nostro algoritmo HMepol, mantiene una rappresentazione limitata della storia al suo interno in cui codifica solo le informazioni utili per scelte future. Una volta introdotta la nostra architettura, forniremo risultati sperimentali delle performance ottenute confronteremo poi tali risultati con altri algo ritmi che compongono lo stato dell’arte.