Time-variant variational transfer for value functions

In the vast landscape of Artificial Intelligence techniques, the Reinforcement Learning (RL) framework stands out for its effectiveness in modelling sequential decision making. In this context, an agent has to learn to act in an environment by iteratively executing an action and observing its outcome; many RL algorithms have been developed for this purpose. In this thesis, we consider only a subset of these algorithms, i.e., value-based methods. These algorithms aim at learning a value function that models the goodness of each action available; the agent, then, can act effectively in the environment by choosing the action that maximizes this function. This is a very powerful technique since it is possible to use any approximator as value function, a popular example is the use of Deep Neural Networks. RL algorithms are very good at learning a new task from the ground up, but, when there is a set of similar tasks to solve, they are inefficient since they are not designed to re-use the knowledge acquired in solving previous tasks. This problem is usually solved with Transfer Learning techniques, which are methods that allow to speed up the learning process in a new task thanks to the experience acquired on previously solved ones. In most of these approaches the distribution over the tasks is assumed to be stationary, which means it doesn't change over time, therefore all the tasks (both the sources and the target) are i.i.d. samples of the same distribution. In the context of this work, we consider the problem of transferring value functions through a variational method when the distribution that generates the tasks is time-variant, proposing a solution that leverages this temporal structure inherent in the task generating process. This is a very relevant context since in real world applications it is very common to find tasks that are generated by a time-variant distribution, a lot of examples can be found in the fields of finance, robotics, energy production and so on. Our method consists in using a special kernel density estimator in which the spatial and temporal components are separated. This density estimator will be used as prior in a variational inference approach. We will provide a convergence theorem for this estimator and, by means of a finite-sample analysis, our solution will be compared to its time-invariant version. Finally, we will provide an experimental evaluation of the proposed technique with three distinct temporal dynamics in three different RL environments.

Nel vasto panorama delle tecniche di Intelligenza Artificiale il framework dell'apprendimento per rinforzo (Reinforcement Learning, RL) risalta per la sua efficacia nel modellare processi decisionali sequenziali. In questo contesto un agente deve imparare ad agire in un ambiente eseguendo iterativamente un'azione e osservando il suo risultato; molti algoritmi di RL sono stati sviluppati a questo scopo. In questa tesi considereremo solo un sottoinsieme di questi algoritmi, ovvero i metodi value-based. Questi algoritmi puntano ad apprendere una funzione di valore (value function) che modella la bontà di ogni azione disponibile, l'agente, quindi, può agire efficacemente nell'ambiente scegliendo l'azione che massimizza questa funzione. Questa è una tecnica molto potente dato che è possibile usare qualsiasi approssimatore come value function, un esempio comune è l'uso di reti neurali profonde (Deep Neural Networks) Gli algoritmi di RL sono molto performanti nell'apprendimento di nuovi compiti (task) da zero ma, quando l'obiettivo è l'apprendimento di un set di task simili tra loro, diventano inefficienti dato che non sono progettati per riutilizzare la conoscenza acquisita nella risoluzione di task precedenti. Questo problema viene solitamente risolto tramite tecniche di Transfer Learning che consentono di velocizzare il processo di apprendimento in un nuovo task grazie all'esperienza acquisita in quelli risolti precedentemente. Nella maggior parte di questi approcci si assume che la distribuzione sui task sia stazionaria, ossia che non cambi nel tempo, quindi tutti i task (sia quelli sorgente che quello obiettivo) sono campioni indipendenti e identicamente distribuiti. In questa tesi considereremo il problema di trasferire le value function attraverso un metodo variazionale quando la distribuzione che genera i task è tempo-variante, proponendo una soluzione che sfrutta questa struttura temporale inerente al processo di generazione dei task. Si tratta di un contesto molto rilevante dato che nelle applicazioni reali è molto comune che i task siano generati da una distribuzione tempo-variante, è possibile trovare molti esempi di ciò nei campi della finanza, della robotica, della produzione di energia e così via. Il nostro metodo consiste nell'usare uno speciale stimatore kernel di densità (kernel density estimator) dove la componente spaziale e quella temporale sono separate. Questo stimatore di densità verrà usato come probabilità marginale (o "a priori") in un approccio di inferenza variazionale. Forniremo un teorema di convergenza per questo stimatore e, attraverso un'analisi a campioni finiti, la nostra soluzione verrà comparata alla versione tempo-invariante. Infine, valuteremo sperimentalmente la tecnica proposta con tre differenti dinamiche temporali in tre distinti ambienti di RL.