The Deep Learning (DL) paradigm has gained remarkable popularity in the last few years. DL models are often used to tackle complex problems in the fields of, e.g., image recognition and healthcare; however, the training of such models requires a very large computational power. The recent adoption of GPUs as general-purpose parallel processors has partially fulfilled this need, but the high costs related to this technology, even in the Cloud, dictate the necessity of devising efficient capacity planning and job scheduling algorithms to reduce operational costs via resource sharing. The proposed work addresses the capacity planning and job scheduling problems jointly, considering both the Cloud end-users’ and the Cloud Providers’ perspective. Mathematical optimization formulations and alternative heuristic methods have been developed to shape efficient and scalable solutions. An extensive experimental campaign proves the feasibility of the proposed approaches for practical scenarios, showing also significant cost savings with respect to first principle methods based on, e.g., first-in-first-out or earliest deadline first, achieving between 50 and 80% cost reductions on average.
I modelli di Deep Learning (DL) hanno assunto una notevole popolarità negli ultimi anni. Essi vengono frequentemente applicati alla soluzione di problemi complessi, in settori come riconoscimento di immagini e assistenza sanitaria; l'addestramento di tali modelli richiede, tuttavia, una potenza computazionale molto ampia. La recente adozione delle GPU per il calcolo parallelo ha parzialmente soddisfatto questa esigenza, ma i costi elevati legati a questa tecnologia, anche nel Cloud, impongono la necessità di ideare algoritmi efficienti di pianificazione della capacità e scheduling delle applicazioni, allo scopo di ridurre i costi operativi attraverso un’opportuna condivisione delle risorse. Il lavoro proposto affronta i problemi di allocazione della capacità e scheduling delle applicazioni in modo congiunto, considerandoli sia dalla prospettiva degli utenti finali Cloud sia dal punto di vista dei Cloud Provider. Sono state sviluppate due formulazioni matematiche e diversi metodi euristici per l’ottimizzazione di questi problemi, allo scopo di fornire soluzioni efficienti e scalabili. Un'ampia campagna sperimentale ha dimostrato l’applicabilità degli approcci proposti a scenari pratici, mostrando anche significativi risparmi sui costi rispetto a modelli di gestione delle risorse, basati, ad esempio, su first-in-first-out o earliest deadline first, con guadagni compresi, in media, tra il 50 e l'80%.
Job scheduling and optimal capacity allocation problems for deep learning training jobs
FILIPPINI, FEDERICA
2019/2020
Abstract
The Deep Learning (DL) paradigm has gained remarkable popularity in the last few years. DL models are often used to tackle complex problems in the fields of, e.g., image recognition and healthcare; however, the training of such models requires a very large computational power. The recent adoption of GPUs as general-purpose parallel processors has partially fulfilled this need, but the high costs related to this technology, even in the Cloud, dictate the necessity of devising efficient capacity planning and job scheduling algorithms to reduce operational costs via resource sharing. The proposed work addresses the capacity planning and job scheduling problems jointly, considering both the Cloud end-users’ and the Cloud Providers’ perspective. Mathematical optimization formulations and alternative heuristic methods have been developed to shape efficient and scalable solutions. An extensive experimental campaign proves the feasibility of the proposed approaches for practical scenarios, showing also significant cost savings with respect to first principle methods based on, e.g., first-in-first-out or earliest deadline first, achieving between 50 and 80% cost reductions on average.File | Dimensione | Formato | |
---|---|---|---|
2020_04_Filippini.pdf
non accessibile
Descrizione: Thesis text
Dimensione
3.53 MB
Formato
Adobe PDF
|
3.53 MB | Adobe PDF | Visualizza/Apri |
I documenti in POLITesi sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/10589/165216