Trifocal tensor estimation for n-view deep structure-from-motion

Structure-from-Motion (SfM) is the problem of recovering the 3D structure of a scene from n images of the same scene taken at different viewpoints. The 3D structure can be recovered by estimating the position of the cameras from which images were taken (camera pose) as well as depth measurements at each pixel of the images (depth map). SfM pipelines are useful in many applications spanning from autonomous driving to augmented reality. The number n of images used to infer a single depth map is crucial to reduce noise in the depth estimates and better handle occlusions. Neural networks have been used throughout the SfM pipeline to improve performance and robustenss. However, in the general case of n images neural networks typically do not take advantage of well known 3D geometric constraints. All the solutions proposed to alleviate this are limited to the base case of n = 2 views. In this work we propose an SfM pipeline for the general n-view case which efficiently couples neural networks with the use of 3D geometric constraints. The pipeline leverages the Trifocal tensor and presents a novel pose chaining algorithm to expand camera pose estimation to the general case of n images. We also provide a comparison between different Trifocal tensor estimation algorithms together with their implementation. We empirically show that our pipeline outperforms previous state-of-the-art SfM pipelines on the KITTI dataset while displaying promising results on ETH3D.

Structure-from-Motion (SfM ) è il problema che consiste nello stimare la struttura tridimensionale di una scena a partire da n immagini della stessa, scattate da diversi punti di vista. La struttura 3D può essere recuperata stimando la posizione delle fotocamere da cui sono state scattate le immagini (posa della camera) e stimando la profondità per ogni pixel (mappa di profondità). Gli algoritmi di SfM sono utili in molte applicazioni che spaziano dalla guida autonoma alla realtà aumentata. Il numero n di immagini usate per ricavare una singola mappa di profondità è un parametro cruciale che permette di ridurre il rumore nelle stime di profondità e di gestire meglio le occlusioni. Le reti neurali sono state utilizzate in tutte le fasi degli algoritmi SfM per migliorarne prestazioni e robustezza. Tuttavia, nel caso generale di n immagini le reti neurali spesso fanno un uso limitato o inefficiente di noti vincoli geometrici tridimensionali. Tutte le soluzioni proposte per alleviare questi svantaggi sono limitate al caso base di n = 2 immagini. In questo lavoro proponiamo un algoritmo SfM ad n immagini che accoppia in modo efficiente le reti neurali con l’uso di vincoli geometrici 3D. La nostra proposta sfrutta il tensore Trifocale e inlcude un nuovo algoritmo di concatenamento della posa per espandere la stima della posa delle camere al caso generale di n immagini. Oltre a ciò, forniamo anche un confronto tra diversi algoritmi di stima del tensore Trifocale insieme alla loro implementazione. Infine, mostriamo empiricamente che il nostro algoritmo proposto supera i precedenti algoritmi stato dell’arte di SfM sul dataset KITTI e ottiene risultati promettenti su quello di ETH3D.