This thesis provides a study of the policy improvement step that can be usefully exploited by approximate policy–iteration algorithms. While the exact policy iteration approach is guaranteed to improve the performance of the policy at each iteration, approximate policy iteration may generate a policy that is worst than the previous one. When either the policy evaluation step or the policy improvement step returns an approximated result, the sequence of policies evaluated by policy iteration may not be monotonically increasing, and oscillations may occur. To address this issue, we consider safe policy improvements, i.e., a policy selection approach that guarantees to improve the performance at each iteration. The approach consists of computing the policy that maximizes a lower bound to the policy improvement w.r.t. the current policy. When no improving policy can be found the algorithm stops. Initially we focus on the theoretical analysis and derivation of a general lower bounds on the policy improvement of a generic policy compared to another policy. Exploiting such a bound, we propose two safe policy–iteration algorithms that differ in the way the next policy is chosen w.r.t. the estimated greedy policy. The sample complexity and the performance guaranteed of these methods are derived showing that algorithms are able to achieve a monotonically improving policy performance. Besides being theoretically derived and discussed, the proposed algorithms are empirically evaluated and compared with state–of–the–art approaches on a simple test domain. Safe policy iteration algorithms have been designed for discrete problems where state enumeration is not prohibitive. In fact, both the proposed approaches require to enumerate the state space in order to compute the greedy policy. Nonetheless, this thesis gives significant theoretical and empirical contributions to the solution of the policy oscillation phenomenon.

Safe policy iteration : a monotonically improving approximate policy iteration approach

PECORINO, ALESSIO;PIROTTA, MATTEO
2011/2012

Abstract

This thesis provides a study of the policy improvement step that can be usefully exploited by approximate policy–iteration algorithms. While the exact policy iteration approach is guaranteed to improve the performance of the policy at each iteration, approximate policy iteration may generate a policy that is worst than the previous one. When either the policy evaluation step or the policy improvement step returns an approximated result, the sequence of policies evaluated by policy iteration may not be monotonically increasing, and oscillations may occur. To address this issue, we consider safe policy improvements, i.e., a policy selection approach that guarantees to improve the performance at each iteration. The approach consists of computing the policy that maximizes a lower bound to the policy improvement w.r.t. the current policy. When no improving policy can be found the algorithm stops. Initially we focus on the theoretical analysis and derivation of a general lower bounds on the policy improvement of a generic policy compared to another policy. Exploiting such a bound, we propose two safe policy–iteration algorithms that differ in the way the next policy is chosen w.r.t. the estimated greedy policy. The sample complexity and the performance guaranteed of these methods are derived showing that algorithms are able to achieve a monotonically improving policy performance. Besides being theoretically derived and discussed, the proposed algorithms are empirically evaluated and compared with state–of–the–art approaches on a simple test domain. Safe policy iteration algorithms have been designed for discrete problems where state enumeration is not prohibitive. In fact, both the proposed approaches require to enumerate the state space in order to compute the greedy policy. Nonetheless, this thesis gives significant theoretical and empirical contributions to the solution of the policy oscillation phenomenon.
ING V - Scuola di Ingegneria dell'Informazione
4-ott-2012
2011/2012
Tesi di laurea Magistrale
File allegati
File Dimensione Formato  
2012_10_Pirotta_Pecorino.PDF

solo utenti autorizzati dal 25/09/2013

Descrizione: Thesis text
Dimensione 2.09 MB
Formato Adobe PDF
2.09 MB Adobe PDF   Visualizza/Apri

I documenti in POLITesi sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10589/67722