Analyzing the sim-to-real gap in autonomous vehicle perception

Nowadays, Advanced Driver Assistance Systems (ADAS) are a common feature in new vehicles. Such critical systems may require a large quantity of data to be optimally trained, whose acquisition can be an expensive task both in terms of time and cost. This process can be simplified by generating synthetic data through a simulator, but these data might lead the ADAS components to yield results different from the ones that would have been obtained by training the system on data from the real world; this discrepancy is known as sim-to-real gap. In this work, a photorealistic driving simulator has been used to generate a synthetic dataset for solving semantic image segmentation tasks, to measure the sim-to-real gap related to models trained with it. To do so, the aforementioned dataset has been used to train a deep neural network model, its performances have then been compared with the ones obtained by training the same model with Cityscapes and CamVid, two datasets acquired in the real world. These results show the existence of a performance gap between models trained on simulated data and the same models trained on real-world data, which is mainly revealed by the difficulties that the networks of the first type have in classifying the image pixels associated with the less frequent dataset labels. On the other hand, the outputs from the models trained on the synthetic dataset show that these networks can identify the general structure of the image received in input and correctly classify pixels related to the most common labels, meaning that more thorough works on datasets created with this driving simulator might produce better results.

Al giorno d'oggi, i sistemi avanzati di assistenza alla guida (ADAS) fanno comunemente parte della dotazione dei veicoli più recenti. Sistemi di questo tipo possono richiedere una grande quantità di dati per poter essere addestrati in modo ottimale, e la procedura per la loro acquisizione può essere un compito costoso in termini sia di tempo che di denaro. Questo processo può essere semplificato generando dati sintetici per mezzo di un simulatore, tuttavia questi dati potrebbero spingere gli ADAS a produrre risultati diversi da quelli che si sarebbero ottenuti addestrando il sistema con dati provenienti dal mondo reale; questa discrepanza è nota come sim-to-real gap. In questa tesi, un simulatore di guida fotorealistico è stato utilizzato per generare un dataset sintetico per la risoluzione di task di segmentazione semantica delle immagini, con il fine di misurare il sim-to-real gap inerente ai modelli addestrati con esso. A tal fine, il suddetto dataset è stato utilizzato per addestrare una rete neurale, le cui prestazioni sono state poi confrontate con quelle ottenute addestrando la stessa rete con Cityscapes e CamVid, due dataset acquisiti nel mondo reale. Questi risultati mostrano l'esistenza di un divario prestazionale tra i modelli addestrati con dati sintetici e gli stessi modelli addestrati con dati reali, causato principalmente dalle difficoltà che le reti del primo tipo hanno nel classificare i pixel dell'immagine associati alle etichette meno frequenti all'interno del dataset. D'altro canto, i risultati ottenuti dai modelli addestrati sul dataset sintetico mostrano che essi sono in grado di identificare la struttura generale dell'immagine ricevuta in input e di classificare correttamente i pixel relativi alle etichette più comuni; ciò significa che dei lavori più approfonditi su dataset creati con il questo simulatore di guida potrebbero produrre dei risultati migliori.