On the exploitation of uncertainty to improve Bellman updates and exploration in Reinforcement Learning

The issue of sample efficiency always constituted a matter of concern in Reinforcement Learning (RL) research, where several works have been proposed to address the problem. It is historically well-known that this issue arises from the need of the agent to explore the environment it is moving in to improve its knowledge about it, and to exploit simultaneously the actions it considers to be the best to maximize its return, creating a trade-off known in RL as exploration-exploitation dilemma. The addressing of this trade-off is central and constitutes a measure of effectiveness of any algorithm available in literature. Moreover, the recent exponential growth of RL research, made possible by the comparable significant improvement in computational power, allowed researchers to extend the study of RL methodologies to high-dimensional problems that were unpractical before, opening the line of research that is now commonly known under the name of Deep Reinforcement Learning (DRL). However, the groundbreaking results that DRL is achieving are obtained at the cost of a huge amount of samples needed for learning, along with very large learning times usually in the order of days. One of the reasons why this is happening, besides the outstanding significance of the results that fundamentally poses the problems of the efficiency of these methodologies in the background, relies on the fact that often experiments are run in simulations in which the sample efficiency problem is not such an issue as in real applications. The purpose of this thesis is to study the previously described problems proposing novel methodologies that explicitly consider the concept of \textit{uncertainty} to speed up learning and improve its stability. Indeed, since a relevant goal of an RL agent is to reduce uncertainty about the environment in which it is moving, taking uncertainty explicitly into account can be intuitively an effective way of acting. This solution is not new in RL research, but there is still a lot of work that can be done in this direction and this thesis takes inspiration from the available literature on the subject extending it with novel significant improvements on the state of the art. In particular, the works included in this thesis can be grouped into two parts: one where uncertainty is used to improve the behavior of the Bellman equation and the other where it is used to improve exploration. The works belonging to the former group aim to address some of the problems of action-value estimation in the context of value-based RL, in particular in the estimate of the maximum operator involved in the famous optimal Bellman equation, and more generally in the estimate of all its components. On the other hand, the works belonging to the latter group study different methodologies to improve exploration by studying the use of Thompson Sampling in RL or by introducing a variant of the Bellman equation that incorporates an optimistic estimate of the action-value function to improve exploration according to the principle of Optimism in the Face of Uncertainty. All the works presented in this thesis are described, theoretically studied, and eventually, empirically evaluated on several RL problems. The obtained results highlight the benefits that the explicit exploitation of uncertainty in RL algorithms can provide; indeed, we show how in a large set of problems that have been chosen in order to highlight particular aspects we were interested in, e.g. exploration capabilities, our methods prove to be more stable and faster to learn than others available in the literature.

Il problema della sample-efficiency è sempre stato un motivo di studio nella ricerca sul Reinforcement Learning (RL), in cui diversi lavori sono stati proposti per affrontare la questione. E' storicamente risaputo che questo problema nasca dal bisogno dell'agente di esplorare l'ambiente in cui si muove al fine di migliorare la sua conoscenza di esso e di sfruttare simultaneamente le azioni che ritiene essere le migliori per massimizzare il suo profitto, risultando nel famoso exploration-exploitation dilemma. Affrontare questa questione è di fondamentale importanza e costituisce una misura dell'efficacia di ogni algoritmo in letteratura. Inoltre, la recente crescita esponenziale della ricerca in RL, resa possibile dal comparabile aumento della potenza di calcolo, ha permesso ai ricercatori di estendere i loro studi su RL a problemi molto complessi che erano impraticabili in passato, iniziando la linea di ricerca che ora è comunemente nota come Deep Reinforcement Learning (DRL). Nonostante ciò, gli strabilianti risultati che il DRL sta raggiungendo sono ottenuti al costo di un enorme quantità di campioni necessari per apprendere, insieme a significativi tempi di calcolo generalmente nell'ordine dei giorni. Una delle ragioni, oltre all'estrema importanza dei risultati ottenuti che mette il problema dell'efficienza in secondo piano, consiste nel fatto che spesso gli esperimenti sono eseguiti in simulazioni in cui il problema della sample-efficiency non è di grande importanza come in applicazioni reali. Lo scopo di questa tesi è di studiare i problemi precedentemente descritti proponendo nuove metodologie che considerino esplicitamente il concetto di incertezza per velocizzare e stabilizzare l'apprendimento. Infatti, dato che un significativo obiettivo di un agente RL è quello di ridurre la sua incertezza riguardo all'ambiente in cui si muove, considerare l'incertezza in modo esplicito può essere intuitivamente un modo efficace di agire. Questa soluzione non è nuova in RL, ma c'è ancora molto lavoro che può essere fatto in questa direzione e questa tesi prende ispirazione dalla letteratura disponibile a riguardo, estendendola con miglioramenti significativi sullo stato dell'arte. In particolare, i lavori inclusi in questa tesi possono essere raggruppati in due categorie: una dove l'incertezza è utilizzata per migliorare il comportamento dell'equazione di Bellman, e una dove l'incertezza è utilizzata per migliorare l'esplorazione. I lavori inclusi nella prima categoria puntano ad affrontare alcuni problemi nella stima dei valori delle azioni nel value-based RL, in particolare nella stima del operatore massimo utilizzato nella famosa equazione ottima di Bellman, e più in generale nella stima di ogni sua componente. D'altro canto, i lavori appartenenti alla seconda categoria studiano differenti metodologie per migliorare l'esplorazione attraverso lo studio del Thompson Sampling in RL, oppure introducendo una variante dell'equazione di Bellman che incorpora una stima ottimistica dei valori delle azioni al fine di migliorare l'esplorazione in accordo con il principio dell'Ottimismo di Fronte all'Incertezza. Tutti i lavori presentati in questa tesi sono descritti, studiati nella teoria, e infine testati sperimentalmente in diversi problemi di RL. I risultati ottenuti mostrano i benefici che l'esplicito uso dell'incertezza in RL può dare; infatti, mostriamo come in un gran numero di problemi, scelti appositamente per mostrare particolari aspetti a cui siamo interessati, ad esempio l'efficienza dell'esplorazione, i nostri metodi dimostrino di essere più stabili e veloci ad apprendere rispetto ad altri disponibili in letteratura.

On the exploitation of uncertainty to improve Bellman updates and exploration in Reinforcement Learning

D'ERAMO, CARLO

Abstract

The issue of sample efficiency always constituted a matter of concern in Reinforcement Learning (RL) research, where several works have been proposed to address the problem. It is historically well-known that this issue arises from the need of the agent to explore the environment it is moving in to improve its knowledge about it, and to exploit simultaneously the actions it considers to be the best to maximize its return, creating a trade-off known in RL as exploration-exploitation dilemma. The addressing of this trade-off is central and constitutes a measure of effectiveness of any algorithm available in literature. Moreover, the recent exponential growth of RL research, made possible by the comparable significant improvement in computational power, allowed researchers to extend the study of RL methodologies to high-dimensional problems that were unpractical before, opening the line of research that is now commonly known under the name of Deep Reinforcement Learning (DRL). However, the groundbreaking results that DRL is achieving are obtained at the cost of a huge amount of samples needed for learning, along with very large learning times usually in the order of days. One of the reasons why this is happening, besides the outstanding significance of the results that fundamentally poses the problems of the efficiency of these methodologies in the background, relies on the fact that often experiments are run in simulations in which the sample efficiency problem is not such an issue as in real applications. The purpose of this thesis is to study the previously described problems proposing novel methodologies that explicitly consider the concept of \textit{uncertainty} to speed up learning and improve its stability. Indeed, since a relevant goal of an RL agent is to reduce uncertainty about the environment in which it is moving, taking uncertainty explicitly into account can be intuitively an effective way of acting. This solution is not new in RL research, but there is still a lot of work that can be done in this direction and this thesis takes inspiration from the available literature on the subject extending it with novel significant improvements on the state of the art. In particular, the works included in this thesis can be grouped into two parts: one where uncertainty is used to improve the behavior of the Bellman equation and the other where it is used to improve exploration. The works belonging to the former group aim to address some of the problems of action-value estimation in the context of value-based RL, in particular in the estimate of the maximum operator involved in the famous optimal Bellman equation, and more generally in the estimate of all its components. On the other hand, the works belonging to the latter group study different methodologies to improve exploration by studying the use of Thompson Sampling in RL or by introducing a variant of the Bellman equation that incorporates an optimistic estimate of the action-value function to improve exploration according to the principle of Optimism in the Face of Uncertainty. All the works presented in this thesis are described, theoretically studied, and eventually, empirically evaluated on several RL problems. The obtained results highlight the benefits that the explicit exploitation of uncertainty in RL algorithms can provide; indeed, we show how in a large set of problems that have been chosen in order to highlight particular aspects we were interested in, e.g. exploration capabilities, our methods prove to be more stable and faster to learn than others available in the literature.

Scheda breve

Scheda completa

	Relatore
	
				RESTELLI, MARCELLO
			
	Coordinatore
	
				PERNICI, BARBARA
			
	Tutor
	
				BONARINI, ANDREA
			
	Data
	
				18-feb-2019
			
	Abstract in italiano
	
				Il problema della sample-efficiency è sempre stato un motivo di studio nella ricerca sul Reinforcement Learning (RL), in cui diversi lavori sono stati proposti per affrontare la questione. E' storicamente risaputo che questo problema nasca dal bisogno dell'agente di esplorare l'ambiente in cui si muove al fine di migliorare la sua conoscenza di esso e di sfruttare simultaneamente le azioni che ritiene essere le migliori per massimizzare il suo profitto, risultando nel famoso exploration-exploitation dilemma. Affrontare questa questione è di fondamentale importanza e costituisce una misura dell'efficacia di ogni algoritmo in letteratura. Inoltre, la recente crescita esponenziale della ricerca in RL, resa possibile dal comparabile aumento della potenza di calcolo, ha permesso ai ricercatori di estendere i loro studi su RL a problemi molto complessi che erano impraticabili in passato, iniziando la linea di ricerca che ora è comunemente nota come Deep Reinforcement Learning (DRL). Nonostante ciò, gli strabilianti risultati che il DRL sta raggiungendo sono ottenuti al costo di un enorme quantità di campioni necessari per apprendere, insieme a significativi tempi di calcolo generalmente nell'ordine dei giorni. Una delle ragioni, oltre all'estrema importanza dei risultati ottenuti che mette il problema dell'efficienza in secondo piano, consiste nel fatto che spesso gli esperimenti sono eseguiti in simulazioni in cui il problema della sample-efficiency non è di grande importanza come in applicazioni reali.

Lo scopo di questa tesi è di studiare i problemi precedentemente descritti proponendo nuove metodologie che considerino esplicitamente il concetto di incertezza per velocizzare e stabilizzare l'apprendimento. Infatti, dato che un significativo obiettivo di un agente RL è quello di ridurre la sua incertezza riguardo all'ambiente in cui si muove, considerare l'incertezza in modo esplicito può essere intuitivamente un modo efficace di agire. Questa soluzione non è nuova in RL, ma c'è ancora molto lavoro che può essere fatto in questa direzione e questa tesi prende ispirazione dalla letteratura disponibile a riguardo, estendendola con miglioramenti significativi sullo stato dell'arte. In particolare, i lavori inclusi in questa tesi possono essere raggruppati in due categorie: una dove l'incertezza è utilizzata per migliorare il comportamento dell'equazione di Bellman, e una dove l'incertezza è utilizzata per migliorare l'esplorazione. I lavori inclusi nella prima categoria puntano ad affrontare alcuni problemi nella stima dei valori delle azioni nel value-based RL, in particolare nella stima del operatore massimo utilizzato nella famosa equazione ottima di Bellman, e più in generale nella stima di ogni sua componente. D'altro canto, i lavori appartenenti alla seconda categoria studiano differenti metodologie per migliorare l'esplorazione attraverso lo studio del Thompson Sampling in RL, oppure introducendo una variante dell'equazione di Bellman che incorpora una stima ottimistica dei valori delle azioni al fine di migliorare l'esplorazione in accordo con il principio dell'Ottimismo di Fronte all'Incertezza.

Tutti i lavori presentati in questa tesi sono descritti, studiati nella teoria, e infine testati sperimentalmente in diversi problemi di RL. I risultati ottenuti mostrano i benefici che l'esplicito uso dell'incertezza in RL può dare; infatti, mostriamo come in un gran numero di problemi, scelti appositamente per mostrare particolari aspetti a cui siamo interessati, ad esempio l'efficienza dell'esplorazione, i nostri metodi dimostrino di essere più stabili e veloci ad apprendere rispetto ad altri disponibili in letteratura.
			
	Tipo di documento
	
				Tesi di dottorato
			
	Appare nelle tipologie:
	
				Tesi di Dottorato

File allegati

File	Dimensione	Formato
2019_02_PhD_DEramo.pdf Open Access dal 29/01/2020 Descrizione: Testo della tesi Dimensione 2.35 MB Formato Adobe PDF Visualizza/Apri	2.35 MB	Adobe PDF	Visualizza/Apri

I documenti in POLITesi sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10589/144849