Exploration in policy search via multiple importance sampling

Reinforcement Learning (RL) is an area of Machine Learning that aims at building autonomous agents capable of learning to solve sequential decision problems, such as automation control or robotics tasks. The learning process consists in a sequence of interactions between the agent and the environment, in the presence of a quantitative reward. The goal of the agent is to maximize a performance metric by learning a suitable policy (strategy). This objective entails a critical aspect of the learning process: the balance between the exploration of the environment, which enables the discovery of new profitable actions, and the exploitation of the most rewarding actions already learned. Reinforcement Learning (RL) is a Machine Learning area that aims at building autonomous agents capable of learning to solve sequential decision problems, such as automation control or robotics tasks. The learning process consists of a sequence of interactions between the agent and the environment, in the presence of a quantitative reward. The goal of the agent is to maximize a performance metric by learning a suitable policy (strategy). This objective entails a critical aspect of the learning process: the balance between the exploration of the environment, which enables the discovery of new profitable actions, and the exploitation of the most rewarding actions already learned. In this thesis, we address the exploration-exploitation trade-off in Policy Search (PS), an effective approach to RL for solving control tasks with continuous state-action spaces. PS explicitly models policies as stochastic parametric functions and directly optimizes performance against policy parameters. We design an innovative formulation of the Policy Search problem as a suitable Multi Armed Bandit (MAB) problem. The MAB problem is equivalent to an RL problem in which the environment has a single state and the actions available to the agent are called arms. Such framework readily lends itself to the study of the exploration-exploitation trade-off because of its simplicity. In our formulation, the arm set is the policy parameter space. This allows us to easily transfer some theoretically sound methods of the MAB literature to the PS setting. We propose a novel class of algorithms that effectively explore the parameter space, by leveraging Multiple Importance Sampling to perform an estimation of performance. We also provide theoretical guarantees on their regret \wrt the optimal policy. Specifically, we prove that the regret is sub-linear for both discrete and continuous parameter spaces. Finally, we evaluate our algorithms on tasks of varying difficulty, comparing them with existing MAB and RL algorithms.

Questa tesi si rivolge al problema di esplorazione dello spazio dei parametri che si presenta nei metodi di Policy Search (PS), una branca dell’apprendimento per rinforzo. Questa classe di algoritmi modella esplicitamente la politica di un agente, ovvero il suo comportamento rispetto all’ambiente circostante, come una funzione parametrica dello spazio degli stati, solitamente stocastica, e ha come obbiettivo l’ottimizzazione di una funzione utilità rispetto ai parametri stessi. La funzione utilità è spesso multi modale (con numerosi ottimi locali), per cui l’agente rischia di arrestarsi ad una politica sub-ottimale. Un’ efficacie esplorazione dello spazio dei parametri mitiga questo problema, o lo risolve del tutto. Prendendo spunto dalla letteratura sui Multi Armed Bandit (MAB), riformuliamo in maniera innovativa il problema di PS come un particolare problema MAB, il sui insieme dei bracci è lo spazio dei parametri della politica. Disegniamo poi una nuova classe di algoritmi che sfruttano tecniche di Multiple Importance Sampling per esplorare efficacemente questo spazio, fornendo opportuni risultati teorici. In particolare, proviamo che il regret (il differenziale tra la performance della politica dell’agente e quella ottima, cumulato) dei nostri algoritmi è sub-lineare. Infine, valutiamo sperimentalmente i nostri algoritmi su una serie di problemi di controllo continuo, attraverso simulazioni numeriche, comparandoli ad algoritmi esistenti nella letterature MAB e PS.