Application of reinforcement learning methods to autonomous driving systems in the overtaking scenario

The concept of autonomous driving has become a focal point in transportation research, as promises to change entirely the concept of mobility itself as we know it. Thanks to sensor setups, nowadays it's possible to receive from vehicles vast amounts of data describing the status in realtime, anyway programming an efficient control that is able to deal with the driving task is an extremely complex problem. This type of control challenge perfectly suits the use of methodologies coming from the field of artificial intelligence. In this thesis all the critical issues of the overtake maneuver will be dealt with. The objective of this thesis is to extrapolate the human behavior from a set of overtake simulations in order to train a controller capable to safely perform the same maneuver. Starting from the collected trajectories we will first reconstruct an approximation of the human behavior as a trained neural policy through Behavioral Cloning. We will then pass through Inverse Reinforcement Learning(IRL) in order to reconstruct the reward function of the driver as a linear model of features. This will represent the driver's objectives during the drive and overtake tasks, and they are of universal validity, independently from the type of car (vehicle dynamics) or the road conditions (transition model). This linear set of rules is finally used to perform a direct training onto the final controller. The control system has been constructed as a custom hybrid rule-based parametrized structure, here introduced as HRBC. Unlike more classical Reinforcement Learning(RL) implementations, this allows the user to be able to set fixed boundaries in terms of safety constraints, while leaving total freedom on how the maneuver is actually performed. For training the policy model the Policy Gradient with Parameter-based Exploration algorithm is used; the PGPE is a modern algorithm that is able to obtain faster convergences with smaller variances in the gradient estimate through the use of a hyperpolicy. Finally, tests on new different driving scenarios, are performed. The final obtained controller is able to handle overtake scenarios inside and even outside the training conditions, and the overall training procedure is proved to be a clear improvement from previous solutions in terms of training efficiency, ease of control modeling and overall performance of the final product. All the simulations are performed in IPG Carmaker, while the rest of the implementation involves also the usage of Matlab & Simulink; part of the Reinforcement Learning algorithm are instead implemented in Python with the support of TensorFlow utilities. Author email: bruno.polli@mail.polimi.it

Il concetto di guida autonoma è diventato un punto focale nella ricerca sui trasporti, in quanto promette di cambiare completamente il concetto stesso di mobilità come lo conosciamo. Grazie alle configurazioni dei sensori, oggigiorno è possibile ricevere dai veicoli enormi quantità di dati che descrivono lo stato in tempo reale, comunque programmare un controllo efficiente che sia in grado di affrontare il compito di guida è un problema estremamente complesso. Questo tipo di sfida di controllo si adatta perfettamente all'uso di metodologie provenienti dal campo dell'intelligenza artificiale. In questa tesi verranno affrontate tutte le questioni critiche della manovra di sorpasso. L'obiettivo di questa tesi è estrapolare il comportamento umano da una serie di simulazioni di sorpasso al fine di addestrare un controllore in grado di eseguire in sicurezza la stessa manovra. Partendo dalle traiettorie raccolte, ricostruiremo innanzitutto un'approssimazione del comportamento umano come una politica neurale addestrata attraverso il Behavioral Clonong. Passeremo quindi attraverso l'Inverse Reinforcement Learning (IRL) per ricostruire la funzione di ricompensa del guidatore come un modello lineare di features. Questo rappresenterà gli obiettivi del guidatore durante le attività di guida e sorpasso, che sono di validità universale, indipendentemente dal tipo di auto (dinamica del veicolo) o dalle condizioni della strada (modello di transizione). Questo insieme di regole lineari viene infine utilizzato per eseguire un addestramento diretto sul controllore finale. Il sistema di controllo è stato costruito come una struttura parametrica ibrida basata su regole personalizzate; questo nuovo sistema viene qui introdotto per la prima volta con la denominazione di HRBC. Diversamente dalle più classiche implementazioni di Reinforcement Learning (RL), ciò consente all'utente di impostare limiti fissi in termini di vincoli di sicurezza, lasciando al contempo totale libertà su come la manovra venga effettivamente eseguita. Per la formazione del modello di policy viene utilizzato il criterio del Gradiente con algoritmo di esplorazione basato su parametri; il PGPE è un algoritmo moderno che è in grado di ottenere convergenze più veloci con varianze più piccole nella stima del gradiente attraverso l'uso di un'iperpolicy. Infine, vengono eseguiti test su diversi nuovi scenari di guida. Il controllore finale ottenuto è in grado di gestire scenari di sorpasso all'interno e anche al di fuori delle condizioni di allenamento, e la procedura di addestramento complessiva si è dimostrata un chiaro miglioramento rispetto alle soluzioni precedenti in termini di efficienza del training, facilità di modellazione del sistema di controllo e prestazioni generali del prodotto finale. Tutte le simulazioni sono eseguite in IPG Carmaker, mentre il resto dell'implementazione prevede anche l'utilizzo di Matlab & Simulink; parte dell'algoritmo di Reinforcement Learning è invece implementato in Python con il supporto delle utility TensorFlow. Email autore: bruno.polli@mail.polimi.it