Active transfer of samples in reinforcement learning

In this work, we address the shortcomings of previous sample transferring methods by handling the dissimilarity between tasks efficiently. The contribution of this work is designing an active algorithm, Active Weighted Fitted Q-Iteration (AWFQI), for transferring the samples from the source tasks to the target task, under the assumption that a generative model of the target task is available. Our algorithm actively demands samples that yield high information for solving the task itself. For this purpose, we estimate the conditional variance among the source tasks by applying Gaussian process regression to the collected samples. This variance captures the dissimilarities among tasks, and it is applicable in both discrete and continuous domains. In particular, we treat it as a score function indicating what state-action pairs are likely to be more informative for solving the target task and use it to actively query the generative model. We compare the performance of our method with other state-of-the-art algorithms on transferring the samples such as Relevance-based Transfer, Shared-dynamics Transfer, Importance Weighted Fitted Q-Iteration and a non-transfer algorithm (Fitted Q-Iteration). We show that our method outperforms these algorithms in many situations, specifically when the budget for transferring is limited.

In questo lavoro cerchiamo di risolvere alcune delle maggiori limitazioni di lavori esistenti sul trasferimento di campioni nel contesto di apprendimento per rinforzo. Il maggior contributo consiste nella progettazione e valutazione empirica di un nuovo algoritmo, Active Weighted Fitted Q-Iteration (AWFQI), per trasferire campioni da task sorgenti ad un task target gestendo in modo efficiente le dissimilarita' fra i task. L'algoritmo proposto richiede attivamente campioni (ovvero richiede l'esecuzione di certe azioni in stati arbitrari) al task target sotto l'assunzione che un modello generative di quest'ultimo sia disponibile. Tali campioni sono selezionati in modo da portare molta informazione per la risoluzione del task stesso. A questo fine proponiamo di stimare la varianza condizionata dei modelli di transizione e di reward fra i task sorgenti usando processi Gaussiani per effetttuare regressione sui campioni osservati. Questa varianza cattura le dissimilirita' fra i task e puo' essere calcolata sia in domini discreti che in quelli continui. Nel nostro caso viene trattata come una funzione indicatrice di quali coppie stato-azione sono piu' informative per la risoluzione del task target e, di consequenza, viene usata per richiedere attivamente campioni al modello generativo. Infine, valutiamo empiricamente ll nostro algoritmo in un dominio continuo, confrontandolo con approcci allo stato dell'arte come Relevance-based Transfer, Shared-dynamics Transfer, Importance Weighted Fitted Q-Iteration e un algoritmo che non effettua alcun trasferimento di conoscenza (Fitted Q-Iteration). Mostriamo che il nostro algoritmo ottiene spesso le performance migliori, in particolare in casi in cui il budget di campioni che e' possibile richiedere dal task target o che e' disponibile dai sorgenti e' limitato.