Adaptive optimization of hyper-parameters and reward function via evolutionary algorithm applied on a deep reinforcement learning robotic grasping task

Deep Reinforcement Learning applications are growing thanks to their capability of teaching the agent any task autonomously. The power of Deep Reinforcement Learning is its ability of generalizing the learning so to let the agent to adapt its behavior to different conditions. However, this comes with the cost of a huge amount of samples and interactions with the environment, particularly for continuous control problems. Moreover, the autonomy of learning is guaranteed by a user defined set of hyper-parameters and a tedious design of the reward function, which allows the agent to be guided towards the target task. In this work, a Deep Reinforcement Learning single agent robotic grasping task is the baseline: the robot has to reach the object, grasp it and lift it off the surface of the table. The optimal policy is learnt by using an on-policy algorithm such as Proximal Policy Optimization (PPO). The task is taught entirely in simulation using the MuJoCo environment, involving a Franka Emika Panda manipulator as agent. On top of this framework, the Evolutionary Algorithms (EAs) approach is used through a code implemented by the author in order to evaluate the possibility of reducing the number of samples needed to teach the robot the task. Moreover, the approach has proven to be able to create reward functions in an adaptive way, lightening the user’s task and opening the door to interesting possibilities. To accomplish this, a parallel environment setting is created. More specifically, each agent is an individual of a population, which is encoded into a chromosome. Its gene sequence is a set of hyper-parameters and/or of reward function parameters. These parameters are changed during the training, depending on the performance of the entire population. Thanks to this setting, transferring the learning is possible, enhancing the best solutions to emerge. The creation of the parallel setting is a consequence of the proposed approach: it is an important added value that brings different benefits which are listed in the report. The results of this work present an increase in efficiency of the Reinforcement Learning algorithm training and prove the possibility of creating an adaptive reward function. In particular, the learning is improved with respect to the baseline case, since the agent needs 67% episodes less to learn the task completely thanks to the adaptive transfer learning guaranteed by the evolutionary algorithm.

Le applicazioni di apprendimento sono in crescita grazie alla loro capacità di insegnare all’agente qualsiasi compito autonomamente. Il potere del Deep Reinforcement Learning è la sua capacità di generalizzare l’apprendimento in modo da permettere all’agente di adattare il suo comportamento a diverse condizioni. Tuttavia, ciò comporta il costo di un’enorme quantità di interazioni con l’ambiente, soprattutto in problemi di controllo continuo. Inoltre, l’autonomia dell’apprendi mento è garantita da alcuni iper-parametri e da una funzione di ricompensa, entrambi definiti dall’utente, che consentono all’agente di essere guidato verso il compimento della task. In questo lavoro di tesi, un’operazione di grasping di un singolo agente via Deep Reinforcement Learning è considerata come baseline: il robot deve raggiungere l’oggetto, afferrarlo e sollevarlo dalla superficie del tavolo. La politica ottimale viene appresa utilizzando un algoritmo on-policy Proximal Policy Optimization (PPO). Il compito viene insegnato interamente in simulazione utilizzando l’ambiente MuJoCo, coinvolgendo un manipolatore Franka Emika Panda come agente. In questo lavoro, l’approccio degli algoritmi di evoluzione viene utilizzato tramite un codice implementato dall’autore per valutare la possibilità di ottimizzare l’efficienza dell’apprendimento, riducendo il numero di campioni necessari per insegnare al robot il compito. Inoltre, l’approccio ha dimostrato di essere in grado di creare funzioni di ricompensa in modo adattivo, alleggerendo il compito dell’utente e aprendo la porta a possibilità interessanti. Per fare ciò, più ambienti vengono parallelizzati. Più specificamente, ogni agente è un individuo di una popolazione, codificato in un cromosoma. La sua sequenza genica è un insieme di iper-parametri e/o di parametri della funzione di ricompensa. Questi parametri vengono modificati durante l’allenamento, a seconda delle prestazioni di ogni individuo della popolazione. Grazie a quest’impostazione, è possibile trasferire l’apprendimento, permettendo alle migliori soluzioni di emergere. La creazione degli ambienti in parallelo è una conseguenza dell’approccio proposto: si tratta di un importante valore aggiunto che porta diversi vantaggi, presentati nel report. I risultati di questo lavoro presentano un aumento nell’efficienza dell’addestramento dell’algoritmo di apprendimento di rinforzo e provano la possibilità di creare una funzione di ricompensa adattiva. In particolare, l’efficienza è migliorata rispetto al caso di base, poiché l’agente ha bisogno 67% di episodi in meno per imparare completamente il compito grazie all’algoritmo evolutivo proposto.