Delays in reinforcement learning

Delays are inherent to most dynamical systems. Besides shifting the process in time, they can drastically impact their performance. For this reason, it is usually valuable to study the delay and account for it. Because they are dynamical systems, it is of no surprise that sequential decision-making problems such as Markov decision processes (MDPs) can also be affected by delays. The latter processes are the foundational framework of reinforcement learning (RL), a paradigm whose goal is to create artificial agents capable of learning to maximise their utility by interacting with their environment. RL has achieved strong, sometimes mind-blowing, empirical results, but delays are seldom explicitly accounted for. The understanding of the impact of delay on the MDP is limited. In this dissertation, we propose to study the delay in the agent’s observation of the state of the environment or in the execution of the agent’s actions. We will repeatedly change our point of view on the problem to reveal some of its structure and peculiarities. A wide spectrum of delays will be considered, and potential solutions will be presented. This dissertation also aims to draw links between celebrated frameworks of the RL literature and the one of delays. We will therefore focus on the following four points of view. At first, we consider constant delays. Taking a psychology-inspired approach, we study the impact of predicting the near future in order to estimate the impact of the agent’s actions. We will highlight how this approach relates to models from the RL literature. It will also be the occasion to formally demonstrate a seemingly evident fact: longer delays involve lower performances. The experimental analysis will conclude by showing the validity of the approach. As a second point of view, we will consider the simple approach of imitating an undelayed expert behaviour in the delayed environment. The delay will remain constant at first, but we will extend our study to more exotic types of delay, such as stochastic ones. Although simple, we demonstrate the great theoretical guarantees and empirical results of the approach. Changing for a third time our point of view on constant delay, we consider adopting a non-stationary memoryless behaviour. Although it seemingly ignores the delay, the approach treats the delay’s effect as an unobserved variable that guides its non-stationarity. Building on this idea, we provide a theoretically grounded algorithm for learning such behaviour that we test in realistic scenarios. Finally, our last point of view will consider a broader model than that of constant delay, which includes constant delays as a special case. This model will enable actions to affect multiple future transitions of the environment. Its theoretical properties will be examined to understand its specificities. Based on these properties, some RL algorithms will be ruled out, while others will be tested in various empirical studies. Being a more general model for delays, its understanding has implications for the constant delay frameworks of the previous chapters.

I ritardi sono intrinseci alla maggior parte dei sistemi dinamici. Oltre a spostare il processo nel tempo, possono influenzare seriamente le loro prestazioni. Per questo motivo, è solitamente utile studiare il ritardo e tenerne conto. Essendo sistemi dinamici, i problemi di decision-making sequenziale come i processi decisionali di Markov (MDPs) possono essere influenzati dai ritardi. Questi ultimi processi costituiscono la struttura fondamentale dell'apprendimento per rinforzo (RL), un paradigma il cui obiettivo è creare agenti artificiali in grado di imparare a massimizzare la loro utilità interagendo con il loro ambiente. Molte volte, il RL ha dimostrato risultati empirici stupefacenti, ma raramente si tiene conto esplicitamente dei ritardi. La comprensione dell'impatto del ritardo sugli MDPs è limitata. In questa dissertazione, proponiamo di studiare il ritardo nell'osservazione dello stato dell'ambiente da parte dell'agente o nell'esecuzione delle sue azioni. Cambieremo ripetutamente il nostro punto di vista sul problema per rivelare alcune delle sue strutture e peculiarità. Verrà considerato un ampio spettro di ritardi, e verranno presentate potenziali soluzioni. Questa dissertazione costruirà collegamenti tra i celebri quadri teorici della letteratura RL e quello dei ritardi. Ci concentreremo quindi sui seguenti quattro punti di vista. Innanzitutto, consideriamo ritardi costanti. Adottando un approccio ispirato alla psicologia, studiamo l'impatto della previsione del prossimo futuro per stimare l'effetto delle azioni dell'agente. Metteremo in luce come questo approccio si ricollega ai modelli della letteratura RL. Sarà anche l'occasione di dimostrare formalmente un fatto a prima vista evidente: ritardi più lunghi comportano prestazioni più basse. L'analisi sperimentale concluderà mostrando la validità dell'approccio. Come secondo punto di vista, considereremo il semplice approccio dell'imitazione di un comportamento esperto senza ritardi nell'ambiente ritardato. Inizialmente, il ritardo rimarrà costante, ma estenderemo il nostro studio a tipi di ritardo più esotici, come quelli stocastici. Nonostante la sua semplicità, mostriamo le ottime garanzie teoriche e risultati empirici del nostro approccio. Cambiando per la terza volta il nostro punto di vista, sempre riguardo al ritardo costante, consideriamo l'adozione di un comportamento senza memoria e non stazionario. Anche se sembrerebbe ignorare il ritardo, l'approccio tratta l'effetto del ritardo come una variabile non osservata che guida la sua non stazionarietà. Basandosi su questa idea, forniamo un algoritmo teoricamente fondato per apprendere tale comportamento che testiamo in scenari realistici. Infine, l'ultima parte considererà un modello più ampio di quello del ritardo costante, che include ritardi costanti come caso speciale. Questo modello consentirà alle azioni di influenzare molteplici transizioni future dell'ambiente. Le sue proprietà teoriche saranno esaminate per capire le sue specificità. Sulla base di queste proprietà, alcuni algoritmi di RL saranno esclusi, mentre altri saranno testati in vari studi empirici. Essendo un modello più generale per i ritardi, la sua comprensione ha implicazioni per i quadri teorici di ritardo costante dei capitoli precedenti.