Deep models for unsupervised and weakly-supervised optical flow estimation

In the last years, the exponential growth of multimedia data such as images and videos has pushed the development of new artificial intelligence algorithms for video understanding. In particular, Deep Learning methods have shown to be particularly suitable for this type of tasks being able to solve problems, such as image classification and image segmentation, outscoring traditional approaches. Video analysis is still one of the most challenging computer vision task, due to occlusions, irregular camera motion, variety of of moving objects, etc. At the same time, it is one of the most useful, being present in everyday’s life, in applications such as autonomous driving or video surveillance. Motivated by this, this thesis work focuses on the problem of optical flow estimation. Since no dataset exists which provides ground-truth flow annotations for real-world scenes, indeed no measurement devices exist to record such data with sufficiently high accuracy, we tackled the problem by designing an architecture based on Convolutional Neural Networks trained with an unsupervised learning paradigm to estimate the optical flow between two input frames. Then, we have experimented a different learning paradigm, which we called weakly-supervised, to build a model which predicts along with the optical flow, the semantic segmentation of the input frames, in order to study if semantic annotations, which are easier to retrieve, can have a positive impact on learning the flow estimations. We show that is possible to learn an optical flow estimation network unsupervised; however, the optical flow predictions do not significantly improve with this second architecture. Nevertheless the segmentations produced by our joint model are slightly better than the ones generated by the segmentation network alone.

Negli ultimi anni, la crescita esponenziale di dati quali immagini e video, ha dato nuova linfa allo sviluppo di algoritmi di intelligenza artificiale per l'analisi video. In particolare, quelli di Deep Learning hanno dimostrato di essere notevolmente adatti a trattare questo tipo di problem, e di essere in grado di affrontarne alcuni quali la classificazione e la segmentazione di immagini, superando i risultati degli approcci tradizionali. L'analisi video rimane comunque una delle sfide più complesse, a causa delle occlusioni, dei movimenti irregolari della camera, della varietà di oggetti in movimento, etc. Allo stesso tempo è una delle più utili, essendo presente nella vita di tutti i giorni in applicazioni quali la guida autonoma o la sorveglianza video. Motivati da queste considerazioni, questa tesi si concentra sul problema della stima del flusso ottico. Dal momento che non esiste un dataset che fornisca i flussi ottici di immagini reali, poiché non ci sono apparecchi in grado di misurarlo con precisione, abbiamo affrontato il problema costruendo un modello basato sulle Reti Neurali Convoluzionali che, in maniera non supervisionata, stimasse il flusso ottico tra due frame di input. In seguito, abbiamo sperimentato un diverso approccio, che abbiamo chiamato debolmente supervisionato, costruendo un modello che predicesse, insieme al flusso ottico, la segmentazione semantica dei due frame, analizzando se la guida semantica, che è più facile da ottenere, usata durante la fase di apprendimento, potesse avere un impatto positivo sulle stime del flusso. Abbiamo dimostrato che è possibile addestrare una rete in modo non supervisionato affinché riesca a predire il flusso ottico. Tuttavia, le predizioni del flusso non migliorano in modo significativo con questa seconda architettura, ma che le segmentazioni prodotte da questo modello sono leggermente migliori di quelle generate dalla rete di segmentazione da sola.