A late-cascade fusion framework for 3D object detection from point clouds and multiple images

3D Object Detection is crucial in understanding the environment and planning motion in autonomous driving. Autonomous vehicles are usually equipped with RGB cameras and LiDAR sensors, providing complementary information. Indeed, RGB detectors can leverage the semantic information of the RGB images but provide poor localization estimates, while LiDAR detectors can take advantage of the accurate geometry of the LiDAR Point Clouds but struggle with small objects because of their sparsity. In this work, we present a new way to detect 3D objects from multimodal inputs, leveraging both LiDAR and RGB cameras in a hybrid late-cascade scheme, that combines an RGB detection network and a 3D LiDAR detector. We exploit late fusion principles to reduce LiDAR False Positives, matching LiDAR detections with RGB ones by projecting the LiDAR bounding boxes on the image. We rely on cascade fusion principles to recover LiDAR False Negatives leveraging frustums extracted by RGB detections that are not matched with any LiDAR detection. In our solution, we mitigate the cascade fusion computational overhead by exploiting it only in specific small regions associated with missed LiDAR detections, while we process the whole scene with more computationally efficient late fusion approaches. Our solution is independent of the underlying detectors and can be plugged on top of any pair of single-modal detectors, enabling a flexible training process and possibly taking advantage of pre-trained models, or training the two branches separately. We evaluate our results on the KITTI object detection benchmark and on the nuScenes dataset, showing significant performance improvements, especially for more challenging classes like Pedestrians and Cyclists in KITTI and Motorcycles and Bicycles in nuScenes.

Il rilevamento di oggetti 3D è un problema fondamentale per comprendere l'ambiente e pianificare i movimenti nel campo delle auto a guida autonoma. I veicoli autonomi sono generalmente dotati di telecamere RGB e sensori LiDAR, che forniscono informazioni complementari. Infatti, i rilevatori di oggetti 3D basati su dati RGB possono sfruttare le informazioni semantiche delle immagini RGB, ma forniscono stime di localizzazione 3D poco precise, mentre i rilevatori basati sulle informazioni del LiDAR possono sfruttare la geometria accurata delle Nuvole di Punti del LiDAR, ma faticano a rilevare oggetti piccoli a causa della loro scarsa densità. In questo lavoro, presentiamo un nuovo metodo per rilevare oggetti 3D da input multimodali, sfruttando sia LiDAR che telecamere RGB in uno schema ibrido che combina principi di late e cascade fusion, mettendo insieme una rete di rilevazione 2D da immagini RGB e un rete di rilevazione 3D dal LiDAR. Utilizziamo i principi della late fusion per ridurre i falsi positivi del LiDAR, abbinando le rilevazioni LiDAR con quelle RGB proiettando le bounding box LiDAR sull'immagine. Ci basiamo sui principi di cascade fusion per recuperare i falsi negativi del LiDAR sfruttando i frustum estratti dalle rilevazioni RGB che non corrispondono a nessuna rilevazione LiDAR. Nella nostra soluzione, riduciamo il sovraccarico computazionale di cascade fusion applicandola solo in specifiche piccole aree associate a rilevazioni LiDAR mancate, mentre elaboriamo l'intera scena con approcci di late fusion più efficienti dal punto di vista computazionale. La nostra soluzione è indipendente dalle reti sottostanti e può essere integrata su qualsiasi coppia di reti unimodali, consentendo un processo di addestramento flessibile e potenzialmente sfruttando modelli pre-addestrati o addestrando separatamente i due rami. Il nostro approccio è stato valutato sui dataset KITTI e nuScenes, mostrando significativi miglioramenti delle prestazioni rispetto ai metodi basati su di una modalità singola, in particolare per le classi più difficili come Pedoni e Ciclisti in KITTI e Moto e Biciclette in nuScenes.