2D High-Resolution and Sparse Images Classification Comparing Dense and Sparse Models

Image classification is a well-known challenge in deep learning and computer vision. Over the years a large number of different networks and algorithms have been developed in this area. The majority of these techniques perform well on dense and low-resolution images and they are typically trained in fine-tuning pre-trained settings. On the other hand, these models poorly perform or even cannot be trained on high-resolution and sparse images due to the high demand for computational and memory resources. The implementation of different layers - specifically, sparse convolutional layers - is needed to manage these conditions. The major goal of this thesis is to implement efficient and effective architectures based on sparse convolutional layers by leveraging the main intuition behind four of the most known architectures in literature, to classify 2d high-resolution and sparse images. In particular, I design a pair of networks for each well-known architectures style - VGG, ResNet, EfficientNet and InceptionNet. Each pairs consists of two models: one implemented with standard convolutional layers and the other with sparse convolutional layers. In each pair, I choose to handle the two models in the closest possible way in order to have a fair comparison: same number of layers, similar number of parameters and they are also trained and tested over the same training and test sets. Finally, I compared them with the relative well-known architectures: VGG16, ResNet34, EfficientNetB0, InceptionNetV1. In my experiments, I handle Assamese Handwritten Characters dataset, captured by an high-resolution tablet. They show that processing high-resolution and sparse images by sparse convolutional layers yields more accurate classification performance than standard convolutional layers for each custom designed model. Moreover, the sparse models perform better than most of well-known fine-tuning pre-trained architectures. Therefore, the sparse models efficiently and effectively handle the sparse and high-resolution images in full-resolution settings without occurring in memory saturation and computational issues. Instead, the dense models handle only low-resolution images, the resolution reduction process results in a loss of relevant information.

La classificazione delle immagini è una sfida nota nell'ambito del deep learning e del computer vision. Nel corso degli anni sono stati sviluppati un gran numero di reti e algoritmi diversi in questi settori. La maggior parte di queste tecniche ha ottime prestazioni su immagini dense e a bassa risoluzione e di solito vengono addestrate in modalità pre-addestrata facendo fine-tuning. D'altra parte, questi modelli hanno scarse prestazioni o addirittura non possono essere addestrate su immagini ad alta risoluzione e sparse, a causa dell'elevata richiesta di risorse computazionali e di memoria. L'implementazione di diversi layers, in particolare layers convoluzionali sparsi, è necessaria per gestire queste condizioni. L'obiettivo principale di questa tesi è quello di implementare architetture efficienti ed efficaci basate su layers convoluzionali sparsi, sfruttando l'intuizione principale alla base di quattro delle architetture più note in letteratura, per classificare immagini 2d ad alta risoluzione e sparse. In particolare, progetto una coppia di reti per ogni stile di architettura noto - VGG, ResNet, EfficientNet e InceptionNet. Ogni coppia è composta da due modelli: uno implementato con layers convoluzionali standard e l'altro con layers convoluzionali sparsi. Ho scelto di gestire i due modelli - in ogni coppia - nel modo più simile possibile per avere un confronto equo: stesso numero di layers, stesso numero di parametri e i modelli sono stati addestrati e testati sugli stessi training and test set. Infine, li ho confrontati con le relative architetture più note: VGG16, ResNet34, EfficientNetB0, InceptionNetV1. Nei miei esperimenti, ho trattato il dataset Assamese Handwritten Characters, acquisito da un tablet ad alta risoluzione. I risultati dimostrano che l'elaborazione di immagini ad alta risoluzione e sparse da parte di layers convoluzionali sparsi mostrano classificazioni più accurate rispetto ai layers convoluzionali standard per ogni modello considerato. Inoltre, i modelli sparsi forniscono prestazioni migliori rispetto alla maggior parte delle note architetture pre-addestrate facendo fine-tuning. Pertanto, i modelli sparsi gestiscono in modo efficiente ed efficace le immagini sparse e ad alta risoluzione in configurazione a risoluzione completa, senza incorrere nella saturazione della memoria e in problemi computazionali. I modelli densi, invece, processano solo immagini a bassa risoluzione. Le tecniche di riduzione della risoluzione portano ad una perdita di informazioni rilevanti.