Multi-view letrning through different levels of abstraction extracted by deep neural networks

The present thesis focuses on combining two novel approaches that are becoming more and more important nowadays, namely deep learning and multi-view learning. After the work of Geoffrey Hinton and Ruslan Salakhutdinov of 2006, Deep Learning (DL) became increasingly important to the point of being a key part in various systems for different disciplines. The main fields are Automatic Speech Recognition (ASR) and Visual Recognition (VR). Different typologies of network were used to reach the state-of-the- art. For instance, the Convolutional Neural Networks (CNN) is the best performing architecture for VR tasks. On the other hand, nowadays, we have different sources to gather data and information for a tasks of our interest. These new techniques are called Multi-View Learning (MVL). The goal is to increase the performance and improve the results by exploiting the agreement and complementary information between different sources. However, sometimes it is too expensive to collect data from different sources or simply, we do not have them. It is possible that there is not enough information or there is access only to one view. To avoid this problem, different techniques were developed in order to extract multiple views from one single view. This research combines DL with MVL techniques based on the following intuition. Usually, in a Deep Neural Network (DNN) we use only the “features” of the last layer to make a prediction ignoring all the features that we can extract from the previous hidden layers. The underlying idea of this thesis is to treat each layer as a different view of the original object and combine them to improve the performance of the classic DNN.

L’obiettivo della presente tesi si concentra sul combinare due nuovi approcci che oggigiorno stanno diventando sempre più importanti, cioè deep learning e multi-view learning. Dopo il lavoro Geoffrey Hinton e Ruslan Salakhutdinov del 2006 , Deep Learning (DL) è diventato sempre più importante fino al punto di diventare una parte chiave in differenti sistemi per varie discipline. I campi principali sono Automatic Speech Recognition (ASR) e Visual Recognition (VR). Differenti tipologie di reti sono state utilizzate, e alcune di loro hanno raggiunto lo stato dell’arte. Per esempio, le Convolutional Neural Networks (CNN) sono le architetture più performanti per i compiti che riguardano il riconoscimento di immagini. D’altro canto, oggigiorno, abbiamo a disposizioni diverse fonti dalle quali prendere dati e informazioni per svolgere le predizioni che a noi interessano. Nasce così il Multi-View Learning (MVL) con l’obiettivo di incrementare le performance e migliorare i risultati sfruttando le informazioni concordi e complementari tra le diverse sorgenti. Comunque, ogni tanto è troppo costoso collezionare dati da sorgenti diverse o semplicemente non abbiamo a disposizione così tante “viste”. È possibile che non abbiamo abbastanza informazioni o che abbiamo accesso solamente a una sorgente. Per evitare questo problema, diverse tecniche sono state adoperate con l’obiettivo di estrarre “multiple views” da una sola vista. Questa ricerca combina DL con le tecniche di MVL basandosi sulla seguente intuizione. Solitamente, in una Deep Neural Network (DNN) noi usiamo solamente le “features” dell’ultimo livello per fare la predizione ignorando tutte le features che abbiamo estratto dai precedenti livelli nascosti. L’idea di sottofondo di questa tesi è di trattare ogni livello della rete come una vista diversa dell’oggetto originale e combinarle con l’obiettivo di incrementare le performance delle classiche DNN.