An L0-norm sparsification approach for multilayer perceptron training

In a number of applications, Feedforward Neural Networks have been found to perform well in terms of both training and testing accuracy. Concerning the network complexity, hence the number of parameters, the fundamental theory of generalization favours simplicity: models with fewer parameters can be expected to perform better on testing data. Too many parameters make the network too sensitive to the 'noise' of the training data and thus unable to generalize well to unseen data. In this thesis, starting from the class of learning algorithm based on decomposition techniques presented by L. Grippo, A. Manno and M. Sciandrone in 2016, we propose a two-phase training algorithm for Multilayer Peceptrons with a single hidden layer, which takes into account both network accuracy and network sparsity. In the first phase, to the traditional mean square error on the training set, we add a convex approximation of the L0-norm of the weights of the network. By minimizing the combined objective function, we try to drive to zero as many weights as possible, while keeping the value of the error low and, consequently, the network performance high. In the second phase the weights with sufficiently small values are removed and the resulting network is re-trained in order to try to increase both the training and testing accuracy. In particular, we investigated separately the problem of the sparsification of the input weights, which generalizes the concept of feature selection, and that of the sparsification of the output weights, which instead can be seen as an effective way to prune the network. The proposed algorithms have been tested on 10 well-known datasets from the literature and the results obtained have been compared with the ones achieved by the original algorithm. Both the algorithms allow us to remove a significant percentage of weights and units from the network, while keeping virtually unchanged the training and testing accuracy.

In varie applicazioni, le Reti Neurali 'Feedforward' hanno fornito degli ottimi risultati sia in termini di accuratezza sull’insieme di addestramento che sui dati di test. Per quanto riguarda la complessità della rete, e quindi il numero di parametri, la teoria della generalizzazione favorisce la semplicità: ci si può aspettare che modelli con un minor numero di parametri si comportino meglio su dati non ancora osservati. Troppi parametri rendono la rete troppo sensibile al 'rumore' presente nei dati dell’insieme di addestramento, quindi incapace di generalizzare correttamente. In questa tesi, partendo dalla classe di algoritmi di addestramento basati su metodi di decomposizione presentati da L. Grippo, A. Manno e M. Sciandrone nel 2016, proponiamo un algoritmo di addestramento suddiviso in due fasi che tiene conto contemporaneamente di quanto è sparsa la rete e delle sue prestazioni. Nella prima fase, alla classica funzione obiettivo costituita dall'errore quadratico medio sull’insieme di addestramento, viene aggiunto un termine costituito da un'approssimazione convessa della norma L0 dei pesi della rete. In questo modo, minimizzando la funzione obiettivo, si cerca di portare a zero il maggior numero di pesi possibile, mantenendo allo stesso tempo basso l'errore e quindi, di conseguenza, alte le prestazioni della rete. Nella seconda fase, vengono rimossi tutti i pesi che assumono un valore sufficientemente basso e la rete che ne risulta viene riaddestrata in modo da cercare di migliorarne le prestazioni sia sull’insieme di addestramento che su quello di test. Nello specifico, sono affrontati separatamente il problema della sparsificazione dei pesi di input, che generalizza il concetto di selezione delle caratteristiche, e quello della sparsificazione dei pesi di output, che invece può essere visto come metodo efficace per eliminare unità dalla rete. Gli algoritmi proposti sono stati testati su 10 noti dataset presi dalla letteratura e i risultati ottenuti sono stati confrontati con quelli conseguiti utilizzando l’algoritmo di partenza. Entrambi gli algoritmi ci consentono di rimuovere una percentuale significativa di pesi e unità dalla rete, mantenendo al tempo stesso quasi inalterata l’accuratezza sugli insiemi di addestramento e di test.