Balancing safety and exploration in policy gradient

Biblioteche e Archivi
POLITesi - Archivio digitale delle tesi di laurea e di dottorato

Reinforcement Learning is a powerful framework that can be used to solve com- plex control tasks. Among the current challenges that reinforcement learning has to face, there are the inner difficulties of exploring the environment and doing it safely. Safe Reinforcement Learning is necessary for critical applications, such as robotics, where exploratory behaviours can harm systems and people, but it also lends it- self to economic interpretations. However, safe algorithms often tend to be overly conservative, sacrificing too much in terms of speed and exploration. The latter, in particular, is of fundamental importance for a learning algorithm to gain infor- mations about the environment. In this thesis, we will investigate the non-trivial tradeoff between these two competing aspects, safety and exploration. Starting from the idea that a practical algorithm should be safe as needed, but no more, we identify interesting application scenarios and propose Safely-Exploring Policy Gradient (SEPG), a very general policy gradient framework that can be customized to match particular safety constraints. To do so, we generalize existing bounds on performance improvement for Gaussian policies to the adaptive-variance case and propose policy updates that are both safe and exploratory.

L'apprendimento per rinforzo è un potente framework che può essere usato per risolvere problemi di controllo complessi. Tra le sfide attuali in questo campo, compare il problema di come esplorare l'ambiente e di come farlo in modo sicuro. L'apprendimento per rinforzo sicuro è necessario per applicazioni critiche, come la robotica, dove comportamenti esplorativi possono danneggiare i macchinari e le persone, ma si presta anche a interpretazioni economiche. Tuttavia, gli algoritmi di apprendimento sicuro tendono spesso ad essere eccessivamente conservatrici, sacrificando troppo in termini di velocità di apprendimento ed esplorazione. Quest'ultima, in particolare, è di fondamentale importanza per ottenere informazioni sull'ambiente. In questa tesi, consideriamo il compromesso non banale tra questi due aspetti contrastanti, sicurezza ed esplorazione. Partendo dall'idea che un algoritmo pratico dovrebbe essere sicuro quanto basta, ma non di più, abbiamo identificato degli scenari applicativi interessanti in un nuovo framework che abbiamo chiamato Safely-Exploring Policy Gradient (SEPG). SEPG è un algoritmo generale di policy gradient che può essere personalizzato per soddisfare particolari vincoli di sicurezza. Per fare ciò, abbiamo generalizzato i risultati esistenti nel campo dell'apprendimento per rinforzo sicuro con politiche gaussiane al caso di esplorazione adattiva.

Balancing safety and exploration in policy gradient

BATTISTELLO, ANDREA

2017/2018

Abstract

Reinforcement Learning is a powerful framework that can be used to solve com- plex control tasks. Among the current challenges that reinforcement learning has to face, there are the inner difficulties of exploring the environment and doing it safely. Safe Reinforcement Learning is necessary for critical applications, such as robotics, where exploratory behaviours can harm systems and people, but it also lends it- self to economic interpretations. However, safe algorithms often tend to be overly conservative, sacrificing too much in terms of speed and exploration. The latter, in particular, is of fundamental importance for a learning algorithm to gain infor- mations about the environment. In this thesis, we will investigate the non-trivial tradeoff between these two competing aspects, safety and exploration. Starting from the idea that a practical algorithm should be safe as needed, but no more, we identify interesting application scenarios and propose Safely-Exploring Policy Gradient (SEPG), a very general policy gradient framework that can be customized to match particular safety constraints. To do so, we generalize existing bounds on performance improvement for Gaussian policies to the adaptive-variance case and propose policy updates that are both safe and exploratory.

Scheda breve

Scheda completa

	Relatore
	
				RESTELLI, MARCELLO
			
	Correlatore/i
	
				PAPINI, MATTEO
			
	Scuola / Dip.
	
				ING  - Scuola di Ingegneria Industriale e dell'Informazione
			
	Data
	
				25-lug-2018
			
	Anno accademico
	
				2017/2018
			
	Abstract in italiano
	
				L'apprendimento per rinforzo è un potente framework che può essere usato per risolvere
problemi di controllo complessi. Tra le sfide attuali in questo campo, compare il problema di come esplorare l'ambiente e di come farlo in modo sicuro.
L'apprendimento per rinforzo sicuro è necessario per applicazioni critiche, come la robotica, dove comportamenti esplorativi possono danneggiare i macchinari e le persone, ma si presta anche a interpretazioni economiche. Tuttavia, gli algoritmi di apprendimento sicuro tendono spesso ad essere eccessivamente conservatrici, sacrificando troppo in termini di velocità di apprendimento ed esplorazione. Quest'ultima, in particolare, è di fondamentale importanza per ottenere informazioni sull'ambiente. In questa tesi, consideriamo il compromesso non banale tra questi due aspetti contrastanti, sicurezza ed esplorazione. Partendo dall'idea che un algoritmo pratico dovrebbe essere sicuro quanto basta, ma non di più, abbiamo identificato degli scenari applicativi interessanti in un nuovo framework che abbiamo chiamato Safely-Exploring Policy Gradient (SEPG). SEPG è un algoritmo generale di policy gradient che può essere personalizzato per soddisfare particolari vincoli di sicurezza. Per fare ciò, abbiamo generalizzato i risultati esistenti nel campo dell'apprendimento per rinforzo sicuro con politiche gaussiane al caso di esplorazione adattiva.
			
	Tipo di documento
	
				Tesi di laurea Magistrale
			
	Appare nelle tipologie:
	
				Tesi di laurea Magistrale

File allegati

File	Dimensione	Formato
Tesi_battistello_v1.0.pdf accessibile in internet per tutti Descrizione: Tesi corretta Dimensione 1.12 MB Formato Adobe PDF Visualizza/Apri	1.12 MB	Adobe PDF	Visualizza/Apri

I documenti in POLITesi sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10589/141790