Augmented Super-Resolution: a novel framework for semantic segmentation using Test Time Augmentation

The state-of-the-art techniques used when dealing with semantic image segmentation applications, i.e. the task of pixel-level labelling, are mostly comprised by deep convolutional models. In this setting, model based on the popular encoder-decoder paradigm are commonly used. In this work we focus on this kind of models and specifically on the decoder part. Usually when working with encoder-decoder models, the encoder is made by a complex deep backbone network that extracts informative, but low-resolution, features from the input image. Such features are then brought back to the original input size by the decoder to produce the final segmentation mask. However, unlike encoders, decoders are usually made with the objective of being fast as to save some computational time. For this reason, decoders typically employ simple upsample strategies, like bilinear upsample, but that ultimately bring a degradation effect on the final accuracy (specially for high upsample factors). In this work we propose a way to improve final segmentation performances by swapping common upsample layers with a more advanced scaling technique based on multi-frame super-resolution. The framework, called Augmented Super-Resolution, make use of Test Time Augmentation to obtain multiple low-resolution augmented feature maps from the network. The additional information provided by the augmentation is leveraged by the super-resolution procedure, which combines those low-resolution features in order to reconstruct a better high-resolution output segmentation mask. We have studied the effectiveness of our method on a model that already have state-of-the-art segmentation performances, namely DeepLabV3+. We achieved an average single-class IoU improvement, with respect to the standard model, of 0.9% over the full validation set of PASCAL VOC. We believe that the added computational cost is justified by the final increment in performance.

Le tecniche allo stato dell'arte utilizzate nelle applicazioni legate alla segmentazione semantica delle immagini, i.e. il task che prevede l'assegnazione di class labels a livello dei pixel, sono perlopiù costituite da reti neurali convolutive profonde. In questo contesto vengono solitamente usati i modelli basati sul popolare paradigma encoder-decoder. In questo lavoro ci focalizziamo su questa tipologia di modelli e nello specifico sulla parte di decoder. Tipicamente quando si lavora con modelli encoder-decoder, l'encoder è costituito da una backbone network che estrae dall'immagine in input delle feature maps informative, ma a bassa risoluzione. Queste features vengono poi riportate alla grandezza dell'input originale dal decoder per produrre la maschera finale di segmentazione. Tuttavia, a differenza degli encoder, i decoder vengono tipicamente realizzati con l'obiettivo di essere veloci, in modo da risparmiare sul tempo di computazione. Per questa ragione i decoder spesso fanno uso di semplici metodi per l'upsample, come l'intepolazione bilineare, che però hanno un effetto di degradazione sull'accuratezza finale (specialmente per fattori di upsample elevati). In questo lavoro proponiamo un metodo per migliorare le performance di segmentazione sostituendo i comuni layer di upsample con un metodo di ridimensionamento più avanzato basato sulla multi-frame super-resolution. Il framework, chiamato Augmented Super-Resolution, utilizza le tecniche di image augmentation a test time, per ottenere diverse augmented feature maps a bassa risoluzione dalla rete. L'informazione aggiuntiva fornita dall'augmentation può essere sfruttata da una procedura di super-resolution, che combina queste features a bassa risoluzione in modo da ricostruire un migliore output di segmentazione ad alta risoluzione. Abbiamo studiato l'efficacia del nostro metodo su un modello, chiamato DeepLabV3+, che presenta già delle performance di segmentazione allo stato dell'arte. Abbiamo ottenuto un incremento medio dell'IoU single-class, in riferimento al modello standard, dello 0.9% considerando tutto il set di validazione di PASCAL VOC. Crediamo che il costo computazionale aggiunto dal metodo venga giustificato dall'incremento finale delle performance.