Studying the performance of deep networks training for speech recognition applications on GPGPUs

Deep Learning (DL) techniques impact many fields, including image analysis, health care, autonomous driving and natural language processing. The interest for DL over the time increased and stimulated the development of various frameworks that provide high-level APIs for design, implementing, and training DL networks. In DL, models are learned, during the training, from huge labeled datasets, therefore this phase requires high computation capabilities. Graphics Processing Unit (GPU) can provide speedup for DL workloads. For this reason frameworks, including TensorFlow, PyTorch, and Keras support the execution on GPU architectures. Cloud Providers offer Virtual Machines (VMs) instances specific for DL applications, with different type and number of graphic cards. Moreover, voice-enabled services are becoming popular. Companies like Amazon, Apple, and Google developed voice assistants that exploit Speech Recognition and Natural Language Processing to interact with users. Voice-enabled devices can be found in home appliances or vehicles and are destined to be part of our life. The focus of this thesis is to study training performance of a state-of-the-art Speech Recognition system, that is Deep Speech, using the Mozilla open source implementation. The approach uses some Machine Learning (ML) techniques, including linear regression and gradient boost ensemble, that exploit features describing the network parameters and hardware information. The chosen models are trained using input data that represent past experimental executions. The resulting models are evaluated in hold-out and extrapolation scenarios. In order to automate the processes and train the models, some libraries have been extended and new ones are implemented. Using Linear Regression (LR) with Sequential Feature Floating Selection (SFFS) and eXtreme Gradient Boosting (XGB) regressor, results show a MAPE lower than 14% for extrapolation on the GPUs number or type. On the whole dataset, considering the extrapolations for the batch size the MAPE is lower than 13.1% and for the number of iterations is lower than 2.2%.

Le tecniche di Deep Learning (DL) hanno un impatto su diversi ambiti, tra cui analisi delle immagini, salute, guida autonoma ed elaborazione del linguaggio naturale. L’interesse per le tecniche di DL e cresciuto e ha stimolato ` lo sviluppo di diversi framework che forniscono API di alto livello per progettare, implementare e addestrare reti di DL. Nel DL, i modelli imparano, durante l’addestramento, da enormi quantità di dati etichettati, quindi questa ` fase richiede capacita computazionali elevate. Le Graphics Processing Unit ` (GPU) possono fornire accelerazioni per lavori di DL. Per questa ragione i framework, come TensorFlow, PyTorch e Keras supportano l’esecuzione su architetture GPU. I Cloud Provider forniscono macchine virtuali specifiche per applicazioni di DL, con diversi tipi e numero di schede video. Inoltre i servizi vocali stanno diventando molto popolari. Le aziende come Amazon, Apple e Google hanno sviluppato assistenti vocali che sfruttano il riconoscimento del parlato e l’elaborazione del linguaggio naturale per interagire con gli utenti. I dispositivi con il supporto vocale possono essere trovati in elettrodomestici o veicoli e sono destinati a far parte della nostra vita. L’obiettivo di questa tesi e quello di studiare le performance ` dell’addestramento di un sistema di stato dell’arte per il riconoscimento vocale, che e Deep Speech, utilizzando un implementazione open source di ` Mozilla. L’approccio utilizza alcune tecniche di Machine Learning (ML), come la regressione lineare e il gradient boosting, sfruttando le feature che descrivono i parametri della rete ed informazioni relative all’hardware. I modelli scelti sono addestrati utilizzando in ingresso dati provenienti da esecuzioni sperimentali effettuate nell’ambito di questa tesi. I modelli risultanti sono poi valutati in scenari di hold-out ed estrapolazione. Al fine di automatizzare i processi e addestrare i modelli, alcune librerie sono state ampliate e nuove sono state implementate. Utilizzando la Regressione Lineare (LR) con Sequential Feature Floating Selection (SFFS) e il regressore eXtreme Gradient Boosting (XGB), i risultati mostrano un Mean Absolute Percentage Error (MAPE) minore del 14% per le estrapolazioni sul numero e sul tipo delle GPUs. Sul dataset completo, considerando le estrapolazioni per il batch size il MAPE e minore del 13.1% e per il ` numero di iterazioni e minore del 2.2%.