Batch reinforcement learning for highway driving

An Autonomous Vehicle (AV), also known as self-driving car or driverless car, is a vehicle that is capable of moving safely with little or no human input. As evidence of the advances in the field of Autonomous Driving (AD), in 2014 the Society of Automotive Engineers (SAE) International defined 5 stages of development, from simple driving assists (level 1) to full automation (level 5), which means that the autonomous system will completely replace the human driver. In order to achieve full automation, the vehicle should be able to react promptly to possibly unseen situations in an intelligent and safe manner, just like a human would. Whereas partial automation can be achieved with classic task planning frameworks, the dynamic nature of the driving task requires a high level of flexibility that can not be guaranteed by classic approaches. For this reason, in the last years, the AD research has focused its efforts on exploring new approaches, mainly methods that belong to a subfield of Artificial Intelligence (AI) known as Reinforcement Learning (RL). Reinforcement learning is a branch of Machine Learning (ML) that aims at solving decision-making problems. In a typical RL setting, an autonomous agent observes and interacts with the surrounding environment, with the purpose of fulfilling a specific goal. By interacting with the environment, the agent observes a reward, which depends on the current state of the environment and the actions performed by the agent, and is usually a measure of how well the agent is behaving. Therefore, most goals can be described as maximizing the cumulative reward over time. For example, a dog can be taught to give its paw by giving the dog some kind of reward every time it performs the task correctly: the dog will receive the reward only if it gives its paw when requested; over time, the dog will eventually learn to associate the extended hand of the owner with the gesture of giving its paw. Many autonomous driving tasks can be well described as RL problems. In our specific case, we tackle the problem of driving through dense traffic in a multi-lane road; the goal of the task is to drive as fast as possible, learning how to efficiently overtake other vehicles without getting stuck in the traffic. The method we propose to solve this problem is called Fitted Q Iteration (FQI), which belongs to a class of RL algorithms known as Batch Reinforcement Learning algorithms. Finally, we provide an empirical evaluation of the proposed algorithm and we compare its performance on a simulated environment with a well known on-line method, the Deep Q Network (DQN) algorithm.

Un veicolo a guida autonoma è un veicolo in grado di muoversi in maniera sicura col minimo o senza alcun intervento umano. A testimonianza dei progressi compiuti nell'ambito della guida autonoma, nel 2014 la SAE International ha definito 5 stadi di sviluppo, da semplici aiuti alla guida (livello 1) fino a una completa autonomia del veicolo (livello 5), nella quale il sistema autonomo rimpiazzerà completamente il guidatore umano. Al fine di sostenere una autonomia completa, il sistema deve essere in grado di reagire in maniera intelligente e sicura a situazioni eventualmente mai affrontate, esattamente come farebbe un essere umano. Mentre una autonomia parziale può essere ottenuta tramite classici strumenti di pianificazione delle attività, la dinamicità dell'attività di guida autonoma richiede un livello elevatissimo di flessibilità che questo tipo di approcio semplicemente non può garantire. Per questo motivo, negli ultimi anni la ricerca ha concentrato i propri sforzi nell'esplorazione di nuove tecniche, principalmente metodi che appartengono a una branca dell'Intelligenza Artificiale nota come Apprendimento per Rinforzo. L'Apprendimento per rinforzo è un ambito dell'apprendimento automatico il cui scopo è studiare e risolvere processi decisionali. In un tipico problema, un agente autonomo osserva e interagisce con l'ambiente circostante, allo scopo di raggiungere un obbiettivo prestabilito. Interagendo con l'ambiente, l'agente ottiene una ricompensa che dipende dalla configurazione dell'ambiente e dalle azioni compiute, ed è solitamente indice della qualità di tali azioni. Di conseguenza, l'obbiettivo dell'agente si riduce a massimizzare la somma delle ricompense ottenute nel corso del tempo. Per esempio, è possibile istruire un cane a dare la zampa, dandogli una ricompensa ogni volta che la procedura è eseguita correttamente: il cane riceverà la ricompensa solo se darà la zampa quando richiesto; col passare del tempo, il cane imparerà ad associare la mano tesa del padrone con il gesto di dare la zampa. La maggior parte delle attività legate alla guida autonoma possono essere descritte come problemi di apprendimento per rinforzo. Nel nostro caso specifico, il problema che ci apprestiamo a risolvere è quello della guida autonoma nel traffico su strada con corsie multiple; l'obbiettivo è guidare il più velocemente possibile, imparando a sorpassare gli altri veicoli senza rimanere imbottigliati nel traffico. Il metodo che proponiamo di utilizzare per risolvere questo problema si chiama Fitted Q Iteration e fa parte di una classe di algoritmi nota come Batch Reinforcement Learning. Mentre negli algoritmi on-line l'agente impara sul posto, attraverso tentativi ed errori, nei metodi batch la fase di apprendimento avviene off-line, usando esperienza collezionata in precedenza. Infine, presentiamo una valutazione dell'algoritmo proposto e lo confrontiamo con Deep Q Network (DQN), un algoritmo on-line che è stato applicato con successo a vari problemi di guida autonoma.