Decentralized and multi agent control of Franka Emika panda robot in continuous task execution

The use of tools to make human beings’ everyday life easier is nothing new. Since we started using simple rocks as aids to make fire, our evolution has gotten us to the point of using machines that can repetitively do a task for us without getting tired and replace some of our ‘’human errors’’ with consistency and efficient. Having said this the machines we use regularly are generally described in a word as “dumb” machines, since they need to be programmed and shown what to do and they just repeat or imitate that path or behavior. The need for smarter machines that can handle unpredictable situations, and that can handle (partially) unknown tasks is becoming more and more present and needed in the industry and soon to become in our homes and daily lives. Robotics is perhaps one of the most powerful realization of automation, and the demand for robots that can perform well in unpredictable and uncertain situations is increasing. The answer to this problem is most definitely machine learning. Applying this to robotics comes with additional challenges since the input to our controller is often composed by partial and noisy information while the output is complex and a time dependent set of signals. For this reason, as well as others we look to reinforcement learning (RL) where the robot aims to learn new and adaptive control policies through interaction with the environment. Instead of giving the robot a big bunch of information of a “well behaving” robot, we give the robot a task setpoint or goal and a reward if its good behavior is performed or a punishment if an undesired behavior is achieved. In this way the robot starts to learn over several iterations of the same problem which actions are good and which actions are undesirable, thanks to the fact that information is stored in a buffer database and later on used to speed up and accelerate the learning process. However, this RL approach requires high samples particularly for continuous control problems and can become cumbersome to handle even for the fastest computers existing nowadays. In this work we present a RL approach to learn continuous control actions for a Panda Franka robot (for the joints of the 7DOF robot as well as the gripper that is simulated with 1 joint moving both finger graspers) through the simplification and use of a decentralized controller. The learning is done using two model free RL algorithms, Proximal Policy Optimization (PPO) and Soft Actor-Critic (SAC). This research aims to show the possibility of decentralizing the control of the robot and using multiple agents to learn how to do the task together with the use of smaller neural networks. The algorithm is designed with a reward function which helps in reinforcing positive (wanted) behavior in the robot through sparse rewards and enables an efficient learning process of the agents while incentivizing exploration. The reward function is split into multiple sparser rewards for positioning the end effector with the cube, successfully grasping the cube, successfully lifting the cube, avoiding any obstacles that are in the way(here we give a negative reward as a punishment), smooth control actions and staying inside the action limits for the robot torques. For the simulation environment of the training the use pf the open space MuJoCo environment and the simulation of a Franka Emika Panda manipulator with a gripper are setup. The task of the robot consists of reaching an object, grasp it and lifting it off the surface without colliding with any obstacles and making the transitions as smooth as possible without reaching any of the joint limits. The generalization of the task allows us to modify parameters and the initial configuration without modifying the learning procedures. The use of multiple agents further generalizes the whole problem and pushes us in the field of modular robotics where each joint of the robot can be seen as a separate smaller robot, and they need to work together to complete a given task.

L'uso di strumenti per rendere più facile la vita quotidiana degli esseri umani non è una novità. Da quando abbiamo iniziato a utilizzare semplici rocce come aiuti per accendere il fuoco, la nostra evoluzione ci ha portato al punto di utilizzare macchine che possono svolgere un compito ripetitivo per noi senza stancarci e sostituire alcuni dei nostri "errori umani" con coerenza ed efficienza. Detto questo, le macchine che usiamo regolarmente sono generalmente descritte in una parola come macchine "stupide", poiché hanno bisogno di essere programmate e mostrate cosa fare. Loro semplicemente ripetono o imitano quell percorso o comportamento. La necessità di macchine più intelligenti in grado di gestire situazioni imprevedibili e in grado di gestire compiti (parzialmente) sconosciuti sta diventando sempre più presente e necessaria nel settore. E presto diventerà presente anche nelle nostre case e nella vita quotidiana. La robotica è forse una delle realizzazioni più potenti dell'automazione, e la domanda di robot in grado di funzionare bene in situazioni imprevedibili e incerte è in aumento. La risposta a questo problema è sicuramente l'apprendimento automatico. La sua applicazione alla robotica comporta ulteriori sfide poiché l'input al nostro controller è spesso composto da informazioni parziali e rumorose mentre l'output è complesso e un insieme di segnali dipendenti dal tempo. Per questo motivo, così come altri, guardiamo all'apprendimento per rinforzo (RL) in cui il robot mira ad apprendere politiche di controllo nuove e adattive attraverso l'interazione con l'ambiente. Invece di fornire al robot una grande quantità di informazioni su un robot "che si comporta bene", lo diamo un setpoint o un obiettivo di attività e una ricompensa se viene eseguito il suo buon comportamento, o una punizione se viene raggiunto un comportamento indesiderato. In questo modo il robot inizia ad apprendere su più iterazioni dello stesso problema quali azioni sono buone e quali sono indesiderabili, grazie al fatto che le informazioni vengono memorizzate in un database buffer e successivamente utilizzate per velocizzare e accelerare il processo di apprendimento. Tuttavia, questo approccio RL richiede campioni elevati in particolare per problemi di controllo continuo e può diventare ingombrante da gestire anche per i computer più veloci esistenti ad oggi. In questo lavoro presentiamo un approccio RL per apprendere azioni di controllo continuo per un robot Panda Franka (per le articolazioni del robot 7DOF e per la pinza che viene simulata con 1 giunto che muove entrambe le pinze per le dita) attraverso la semplificazione e l'uso di un controllore. L'apprendimento avviene utilizzando due algoritmi RL gratuiti, Proximal Policy Optimization (PPO) e Soft Actor-Critic (SAC). Questa ricerca mira a mostrare la possibilità di decentralizzare il controllo del robot e utilizzare più agenti per imparare a svolgere il compito insieme all'uso di reti neurali più piccole. L'algoritmo è progettato con una funzione di ricompensa che aiuta a rafforzare il comportamento positivo (desiderato) nel robot attraverso ricompense sparse e consente un processo di apprendimento efficiente degli agenti incentivando l'esplorazione. La funzione di ricompensa è suddivisa in più ricompense più rare per posizionare l'effettore finale con il cubo, afferrare con successo il cubo, sollevare con successo il cubo, evitare gli ostacoli che si frappongono (qui diamo una ricompensa negativa come punizione), controllo regolare azioni e rimanere all'interno dei limiti di azione per le coppie del robot. Per l'ambiente di simulazione del training sono previsti l'utilizzo dell'ambiente open space MuJoCo e la simulazione di un manipolatore Franka Emika Panda con pinza. Il compito del robot consiste nel raggiungere un oggetto, afferrarlo e sollevarlo dalla superficie senza urtare alcun ostacolo e rendere le transizioni il più fluide possibile senza raggiungere nessuno dei limiti del giunto. La generalizzazione del compito permette di modificare i parametri e la configurazione iniziale senza modificare le procedure di apprendimento. L'uso di più agenti generalizza ulteriormente l'intero problema e ci spinge nel campo della robotica modulare in cui ogni giunto del robot può essere visto come un robot separato più piccolo e devono lavorare insieme per completare un determinato compito.