Safe policy iteration : a monotonically improving approximate policy iteration approach

This thesis provides a study of the policy improvement step that can be usefully exploited by approximate policy–iteration algorithms. While the exact policy iteration approach is guaranteed to improve the performance of the policy at each iteration, approximate policy iteration may generate a policy that is worst than the previous one. When either the policy evaluation step or the policy improvement step returns an approximated result, the sequence of policies evaluated by policy iteration may not be monotonically increasing, and oscillations may occur. To address this issue, we consider safe policy improvements, i.e., a policy selection approach that guarantees to improve the performance at each iteration. The approach consists of computing the policy that maximizes a lower bound to the policy improvement w.r.t. the current policy. When no improving policy can be found the algorithm stops. Initially we focus on the theoretical analysis and derivation of a general lower bounds on the policy improvement of a generic policy compared to another policy. Exploiting such a bound, we propose two safe policy–iteration algorithms that differ in the way the next policy is chosen w.r.t. the estimated greedy policy. The sample complexity and the performance guaranteed of these methods are derived showing that algorithms are able to achieve a monotonically improving policy performance. Besides being theoretically derived and discussed, the proposed algorithms are empirically evaluated and compared with state–of–the–art approaches on a simple test domain. Safe policy iteration algorithms have been designed for discrete problems where state enumeration is not prohibitive. In fact, both the proposed approaches require to enumerate the state space in order to compute the greedy policy. Nonetheless, this thesis gives significant theoretical and empirical contributions to the solution of the policy oscillation phenomenon.

Safe policy iteration : a monotonically improving approximate policy iteration approach

PECORINO, ALESSIO;PIROTTA, MATTEO

2011/2012

Abstract

This thesis provides a study of the policy improvement step that can be usefully exploited by approximate policy–iteration algorithms. While the exact policy iteration approach is guaranteed to improve the performance of the policy at each iteration, approximate policy iteration may generate a policy that is worst than the previous one. When either the policy evaluation step or the policy improvement step returns an approximated result, the sequence of policies evaluated by policy iteration may not be monotonically increasing, and oscillations may occur. To address this issue, we consider safe policy improvements, i.e., a policy selection approach that guarantees to improve the performance at each iteration. The approach consists of computing the policy that maximizes a lower bound to the policy improvement w.r.t. the current policy. When no improving policy can be found the algorithm stops. Initially we focus on the theoretical analysis and derivation of a general lower bounds on the policy improvement of a generic policy compared to another policy. Exploiting such a bound, we propose two safe policy–iteration algorithms that differ in the way the next policy is chosen w.r.t. the estimated greedy policy. The sample complexity and the performance guaranteed of these methods are derived showing that algorithms are able to achieve a monotonically improving policy performance. Besides being theoretically derived and discussed, the proposed algorithms are empirically evaluated and compared with state–of–the–art approaches on a simple test domain. Safe policy iteration algorithms have been designed for discrete problems where state enumeration is not prohibitive. In fact, both the proposed approaches require to enumerate the state space in order to compute the greedy policy. Nonetheless, this thesis gives significant theoretical and empirical contributions to the solution of the policy oscillation phenomenon.

Scheda breve

Scheda completa

	Relatore
	
				RESTELLI, MARCELLO
			
	Scuola / Dip.
	
				ING V - Scuola di Ingegneria dell'Informazione
			
	Data
	
				4-ott-2012
			
	Anno accademico
	
				2011/2012
			
	Tipo di documento
	
				Tesi di laurea Magistrale
			
	Appare nelle tipologie:
	
				Tesi di laurea Magistrale

File allegati

File	Dimensione	Formato
2012_10_Pirotta_Pecorino.PDF solo utenti autorizzati dal 25/09/2013 Descrizione: Thesis text Dimensione 2.09 MB Formato Adobe PDF Visualizza/Apri	2.09 MB	Adobe PDF	Visualizza/Apri

I documenti in POLITesi sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10589/67722