Risk-averse reinforcement learning through reward transformation

This thesis is concerned with risk-averse Reinforcement Learning a subfield of Machine Learning which studies how to develop an independent agent able to take the correct decisions in a determinate environment. This is accomplished by training a model in an environment to take the actions that maximize the mean of the return, the sum of the rewards of these actions, which are designed to encode the desired behaviour we want to attain. Specifically we deal with risk-averse Reinforcement Learning which provides the added difficulty of incorporating some form of risk aversion into the agent. This can be useful when we are faced with an environment that has a certain degree of stochasticity, therefore we want to be able to tell our agent how inclined we are to take a risk for the promise of a greater, however uncertain return. This is often accomplished by optimizing instead of the mean of the returns, some form of risk-measure. However this mathematically changes the problem and has caused the algorithms proposed to be remarkably different to those developed in the main line of risk neutral reinforcement learning. The problem with this is that whenever new progress is made in the main line of risk neutral RL there are no readily available ways of translating this advancement in the risk averse one, which is therefore forced to chase after. The authors of [“MDP Transforma-tion for Risk Averse Reinforcement Learning via State Space Augmentation”, 2020] attempted this and proposed a framework, ROSA, to extend policy gradient algorithms, a powerful class of reinforcement learning algorithms, to the optimization of some popular risk measures like mean-variance and the Conditional Value at Risk. This thesis aim was to find a similar approach for value-based algorithms, another equally important class. We do find an approach which however works for an alternative risk-measure, the volatility, which was introduced recently in [“Risk-Averse Trust Region Optimization”, 2019], and proved to be more suited to our approach. We eventually test it, by deploying risk-averse versions of Fitted Q-value Iteration and Deep Q-Learning Network ,two state-of the art value-based algorithms, in two custom environments and show their limitations and advantages.

Questa tesi tratta del Reinforcement Learning, una branca del Machine Learning che si occupa di risolvere il problema decisionale sequenziale: Sviluppare un agente capace di compiere le azioni giuste in un determinato ambiente. Per guidare l'agente gli si assegna una ricompensa, disegnata con in mente il comportamento che vogliamo l'agente apprenda, dimodochè il problema di RL, nel suo complesso, è matematicamente impostato come ottimizzazione della media della somma di queste ricompense, il cosidetto return. In questa tesi studiamo un filone di ricerca legato al RL che cerca di incorporare il rischio nell'obiettivo dell'ottimizzazione, cosicchè l'agente sia in grado di imparare diversi comportamenti a seconda dei diversi livelli di propensione al rischio dell'utente. Questo è utile in ambienti caratterizzati da elevata stocasticità come gli ambienti finanziari. Il rischio viene di solito incorporato direttamente nell'obbiettivo dell'ottimizzazione che invece della media diventa una forma di misura del rischi, il problema di questo approccio è però che questo modifica drasticamente la forma matematica del problema, e, di conseguenza, gli algoritmi proposti sono molto diversi da quelli classici del filone a rischio neutrale: quindi ogni progresso realizzato in quest ultimo campo, non trova facile applicazione nel filone risk averse. Questa tesi si colloca tra i tentativi di porre rimedio a questo scisma: in particolare, prendiamo spunto da ROSA, un algoritmo presentato recentemente in [“MDP Transforma-tion for Risk Averse Reinforcement Learning via State Space Augmentation”, 2020]. ROSA propone un modo di rendere gli algoritmi Policy Gradient, una potente classe di algoritmi di reinforcement learning, adatti all'ottimizzazione di alcune delle più popolari misure di rischio, conservandone i vantaggi, attraverso una trasformazione dei returns. Lo scopo di questa tesi è stato quello di trovare un approccio con la stessa finalità applicato, però, agli algoritmi value-based, un'altra classe che ha visto ampio utilizzo. Raggiungiamo l'obiettivo a patto, però, di utilizzare un'alternativa misura di rischio, la volatility, introdotta di recente in [“Risk-Averse Trust Region Optimization”, 2019], che si è dimostrata più congeniale alla nostra applicazione. Usiamo il nostro approccio per proporre delle versioni adatte all'ottimizzazione della mean-volatility di Fitted-Q-value Iteration (FQI) e Deep Q-Network (DQN), due degli algoritmi value-based più usati, e ne mostriamo le applicazioni su degli ambienti atti a misurarne la capacità di catturare l'avversione al rischio.