Towards resource-aware visual-inertial SLAM

In recent years, Augmented Reality (AR) technologies are experiencing a significant expansion. AR devices are equipped with cameras and visualization components that allow them to project digital objects into the real world. To this end, it is fundamental to construct a 3D model of the surroundings and to estimate the location of one camera within that model. In computer vision, this question is addressed by the problem of Simultaneous Localization And Mapping (SLAM). AR devices can be small headsets able to perform many tasks, leaving limited computing power to perform SLAM. Therefore, one of the main objectives of this thesis is the study of the trade-off between computation time and accuracy on a SLAM algorithm. After a review of the state of the art, ORB-SLAM3 was selected, as one of the most accurate, robust, and complete SLAM methods. We want to find the configuration that guarantees a high level of accuracy while allowing ORB-SLAM3 to run as fast as possible. We tested ORB-SLAM3 on a publicly available dataset with different sensors (i.e. monocular and stereo camera, inertial measurement unit), and by varying the number of features extracted per frame, the parameter that most impacts the computation time. We found the stereo-inertial setup to be the most accurate overall. Furthermore, the considerable diversity of environments makes it impossible to designate a single optimal value for the number of features that guarantees the best accuracy and lowest computation times. Therefore, as the second contribution of this thesis, we propose a modification of ORB-SLAM3 that implements a resource-aware solution to this problem. It consists of a feedback loop based on the current computation time that scales the complexity of the Bundle Adjustment (BA), the process that jointly optimizes the points of the reconstructed environment and the trajectory of the camera. BA is scaled by limiting the number of points to be optimized, which are chosen through a score function. On our system, this modification allows ORB-SLAM3 to perform in real time with a number of features that would normally be just above the real-time threshold. This produces the richest 3D reconstruction the system can afford, while adaptively exploiting all available resources and without a considerable loss of accuracy.

Le tecnologie di Realtà Aumentata (AR) stanno vivendo un momento di espansione significativa. I dispositivi di AR sono dotati di telecamere e componenti per la visualizzazione che permettono di proiettare oggetti digitali nel mondo reale. A tal fine, è fondamentale costruire una mappa 3D dell'ambiente circostante e stimare la posizione di una delle telecamere all'interno della mappa stessa. Nella computer vision, questo problema si inquadra nel'ambito del Simultaneous Localization And Mapping (SLAM). I dispositivi AR possono essere piccoli headset capaci di svolgere molti comptiti, lasciando poca potenza di calcolo per lo SLAM. Quindi, uno dei principali obiettivi di questa tesi è lo studio del trade-off tra il tempo di calcolo e l'accuratezza su un algoritmo di SLAM. Dopo una revisione dello stato dell'arte, abbiamo selezionato ORB-SLAM3, uno degli algoritmi di SLAM più accurati e robusti. Vogliamo trovare una configurazione che garantisca un alto livello di accuratezza e che permetta a ORB-SLAM3 di funzionare il più velocemente possibile. Abbiamo testato ORB-SLAM3 su un dataset pubblicamente disponibile con diversi sensori (telecamera monoculare e stereo, inertial measurement unit o IMU) e variando il numero di feature estratte per frame, il parametro che più incide sul tempo di calcolo. La configurazione stereo-inerziale è risultata complessivamente la più accurata. Per di più, la notevole diversità di ambienti rende impossibile decidere un valore ottimale per il numero di feature. Quindi, proponiamo una modifica di ORB-SLAM3 che implementa una soluzione resource-aware. La modifica consiste in un feedback loop basato sul tempo di calcolo che ridimensiona la complessità del Bundle Adjustment (BA), il processo che ottimizza i punti della mappa 3D e la traiettoria della telecamera. Il BA viene ridimensionato limitando il numero di punti da ottimizzare, i quali vengono scelti attraverso una misura di merito. Sul nostro sistema, questa modifica permette a ORB-SLAM3 di funzionare in tempo reale con un numero di feature che normalmente sarebbe appena al di sopra della soglia del tempo reale. In questo modo si ottiene la ricostruzione 3D più dettagliata che il sistema possa permettersi, sfruttando in modo adattivo tutte le risorse disponibili e senza una notevole perdita di precisione.