Structured State Space Models for panoptic segmentation: a vision Mamba approach

Autonomous driving has become one of the main research fields for Computer Vision and Artificial Intelligence (AI); semi-autonomous and fully autonomous vehicles rely on the accuracy and e!ciency of algorithms built with Machine and Deep Learning approaches which are constantly being improved both in terms of performance, memory consumption and training time requirements. One of the most important of such tasks is panoptic segmentation (cf. Section 1.1.2 for details), a new challenge introduced by Kirillov et al. [19] which aims at unifying the tasks of instance and semantic segmentation (cf. Section 1.1.1). The goal of panoptic segmentation is to both assign a class label to each pixel (semantic segmentation) and to distinguish all di"erent object instances within an image (e.g. di"erent cars, people, etc., addressed by instance segmentation). Earlier works targeting panoptic segmentation were based on Convolutional Neural Networks (Section 1.1.3) due to their e!ciency in image and video analysis; newer ap- proaches have instead tried to exploit the power of the attention mechanism of transform- ers (Section 1.1.4) for solving the task, bringing to evident advancements. In very recent times, a new architecture mimicking the self attention mechanism by making use of Structured State Space Models called Mamba [12] has been improved and adapted for vision tasks (Section 1.1.5), showing comparable results to transformer- based solutions but achieving their performance with less memory and time requirements. Its use for directly producing semantic or instance predictions has however not yet been significantly explored. Furthermore, to this date no one has ever tested its e!cacy for the more complex task of panoptic segmentation. This work aims at closing this research gap, exploring di"erent modalities and archi- tectures using vision mamba approaches for achieving panoptic segmentation.

La guida autonoma è diventata uno dei principali campi di ricerca della Computer Vision e dell’Intelligenza Artificiale (AI); i veicoli semi-autonomi e completamente autonomi si basano sull’accuratezza e sull’e!cienza di algoritmi costruiti mediante approcci di Ma- chine Learning e Deep Learning, costantemente migliorati in termini di prestazioni, con- sumo di memoria e tempi di addestramento. Uno dei compiti più importanti in questo ambito è la segmentazione panottica (cfr. Sezione 1.1.2), una sfida introdotta da Kirillov et al. [19] per unificare i compiti di segmentazione semantica e segmentazione per istanze (cfr. Sezione 1.1.1). Il suo obiettivo è assegnare un’etichetta di classe a ciascun pixel (segmentazione semantica) e distinguere le diverse istanze di oggetti presenti (ad esempio automobili, persone, ecc.). I primi lavori su questo tema si basavano su Reti Neurali Convoluzionali (CNN) (Sezione 1.1.3) grazie alla loro e!cienza nell’analisi di immagini e video; approcci più re- centi hanno invece sfruttato il meccanismo di attenzione dei Transformer (Sezione 1.1.4), portando a notevoli progressi. Di recente, una nuova architettura che mima il meccanismo di auto-attenzione tramite Modelli Strutturati nello Spazio degli Stati (Structured State Space Models), denominata Mamba [12], è stata adattata a task di visione (Sezione 1.1.5), mostrando risultati comparabili ai Transformer ma con minori requisiti di memoria e tempo. Tuttavia, il suo impiego diretto per previsioni semantiche o per istanze non è stato ancora esplorato, né la sua e!cacia per il più complesso compito della segmentazione panottica. Questo lavoro mira a colmare tale lacuna, esplorando diverse modalità e architetture basate su approcci Vision Mamba per la segmentazione panottica.