Pixelwise semantic segmentation of urban scenes using recurrent deep neural networks

Semantic Segmentation is a low level structured prediction task that aims to correctly infer the semantic label of each pixel in an image. Thanks to the breakthrough of Deep Learning, Semantic Segmentation gained a significant interests in particular applied to Autonomous Driving. Almost the totality of the state-of-the-art Semantic Segmentation architecture uses Convolutional Neural Networks (CNNs). Due to the limited size of the local receptive fields, CNNs can usually fail to capture spatially long-range dependence across different local areas of the image. Although these networks have showed their excellent performance for Image Recognition and Classification tasks [1], to correctly modeling the contextual long-term dependencies between distant pixels is crucial for Semantic Segmentation. Recurrent Neural Networks (RNNs) have proved their ability to modeling long-term dependences in several sequential prediction tasks such as speech recognition and language understanding. In this work we focused on ReSeg, a fully recurrent architecture proposed by Visin et Al. [2] for object segmentation. Each recurrent layer consists of two bidirectional RNNs that scan the image vertically and horizontally extracting both local and global features. The recurrent topology of the network allows to learn long term spatial correlation and dependencies between pixels in the image in the context of the entire image. We performed a greedy procedure for hyper-parameter optimization, then, we extensively tested the network on challenging urban scene parsing datasets such as Camvid [3] and Cityscapes [4] showing comparable results to the state-of-the-art convolutional models. We further extended the model combining both convolutional and recurrent layers, showing that the convolutional-based models for Semenatic Segmentation actually benefit from the use of ReSeg layers.

La Segmentazione Semantica (Semantic Segmentation) è un tipo di classificazione di basso livello che ha come obiettivo quello di predire la categoria semantica a cui appartiene ogni pixel all’interno di un’immagine. Negli ultimi anni, grazie ai numerosi sviluppi fatti nell’ambito del Deep Learning, l’interesse nei confronti del problema di Semantic Segmentation ha subito un forte incremento, soprattutto nel contesto della guida autonoma. La maggior parte dei sistemi di Semantic Segmentation è basato su Reti Neurali Convolutive (CNNs). Questo tipo di modelli ha dimostrato di raggiungere performace eccellenti nel problema di riconoscimento di immagini, ma non sono sempre in grado di apprendere correttamente le dipendenze tra aree dell’immagine distanti fra loro. Questo tipo di dipendenze è cruciale al fine di ottenere una buona segmentazione. Le Reti Neurali Ricorrenti (RNNs) hanno dimostrato di essere in grado di modellare le dipendenze a lungo termine in numerosi problemi di predizione sequenziale, quali riconoscimento vocale, comprensione del linguaggio naturale o traduzione automatica. Il nostro lavoro è basato sul modello ricorrente ReSeg proposto da Visin et Al. [2] per la segmentazione di oggetti. Ogni layer ricorrente è formato da due RNNs bidirezionali che scorrono l’immagine verticalmente e orizzontalmente al fine di estrarre features sia locali che globali. Grazie alla sua topologia, la rete è in grado di apprendere le dipendenze spaziali tra i pixel nel contesto dell’intera immagine. In questo lavoro di tesi abbiamo utilizzato una procedura greedy al fine di ottimizzare gli iper-parametri della rete e avere un primo grado di sensibilità sulle capacità del modello. Successivamente abbiamo testato le performance della rete su dataset di segmentazione semantica in ambito stradale come Camvid [3] e Cityscapes [4], ottenendo performance di stato dell’arte confrontabili con quelle dei ben più usati modelli convolutivi. Abbiamo inoltre esteso il modello combinando insieme livelli convolutivi e ricorrenti, dimostrando che i modelli convolutivi possono beneficiare dell’introduzione di livelli ricorrenti incrementando così le performance di segmentazione.