TensorFlow microservices with quality of service guarantees

Nowadays, cloud computing and big-data have made easier and simpler the use of machine learning and deep learning. These techniques allow to create models and extract knowledge from data. This is used in many fields for different purposes such as voice, image, and pattern recognition. Machine learning consists of two phases: the learning phase and the inference phase. During the first phase, that is more complex, a model is created from the data. Regarding this phase, several frameworks are available and ready to be used also in production. In the inference phase, instead, the model is evaluated with new data to render a prediction. This phase appears to be less researched. In fact most of the frameworks, including TensorFlow, allow distributed learning almost out-of-the-box. Even if, inference is simpler, it is complex or not feasible to deploy a model that can scale over a cluster. This thesis aims to create a system that allows the use of a cluster of nodes with different types of hardware, CPU and GPU to deploy models for inference. The second goal is to allow the system to satisfy a service level agreement (SLA) during the inference phase, using a fine-grained allocation of CPU resources. This allows to allocate just the right amount of resources needed for the model. In this context the agreement is stipulated on the average response time per application. The proposed solution exploits two key enablers: containers and control theory. Containers are standard unit of software that package up code and all its dependencies. They are an abstraction at the app layer. Multiple containers can run on the same machine and share the OS kernel with other containers, each running as isolated processes in user space. Containers take up less space than VMs, are faster to boot and allow precise and fast resource allocation. Control theory has received considerable attention as it represents a general methodology for creating adaptive systems. A self-adaptive software is able to allocate the right amount of resources it needs to meet a SLA. The control theory can be used to implement a self-adaptive software applications that allocate the right amount of resources respecting a deadline or meeting application-level quality of service (QoS) goals. In this work we use TensorFlow, one of the most popular machine learning frameworks available today.

L'avvento del cloud computing e l'utilizzo dei big-data hanno facilitato e reso più semplice l'utilizzo del machine learning e del deep learning. Queste tecniche permettono di creare modelli ed estrarre conoscenza dai dati. Il loro utilizzo trova spazio in svariati campi di applicazione quali il riconoscimento vocale, di immagini e di pattern. Il machine learning si compone di due fasi: la fase di learning e la fase di inference. Durante la prima fase, più complessa, si crea il modello dai dati. Per quanto riguarda questa fase sono disponibili diversi framework che hanno raggiunto una certà maturità. Nella fase di inference invece si valuta il modello con nuovi dati, ottenendo una prediction. Questa fase risulta essere meno approfondita e ricercata. Infatti la maggior parte dei framework, tra cui TensorFlow, permettono di fare learning su un cluster di nodi in maniera quasi automatica. Anche se la fase di inference è più semplice, risulta complesso o non fattibile fare il deployment di un modello che possa scalare su un cluster. Questa tesi si pone come primo obiettivo quello di creare un sistema che permetta di poter utilizzare un cluster di nodi con hardware di diverso tipo, CPU e GPU per fare il deployment di più modelli su cui fare inference. Il secondo obiettivo è rispettare un accordo sul livello del servizio (SLA), durante la fase di inferenza, utilizzando un'allocazione a grana fine delle risorse della CPU. Questo permette di allocare la giusta quantità di risorse necessarie per il modello. In questo contesto l'accordo è stipulato sul tempo di risposta medio per applicazione. Il sistema sfrutta due tecnologie abilitanti che sono i container e la teoria del controllo per raggiungere gli obiettivi fissati. I container sono utilizzati a livello software e permettono di racchiudere un'applicazione con il codice e tutte le sue dipendenze in un unico pacchetto. Rappresentano un'astrazione a livello applicativo. Più container possono essere eseguiti sullo stessa macchina e condividere il kernel del sistema operativo con altri container, ciascun container è in esecuzione come processo isolato nello user space. I containers occupano meno spazio delle macchine virtuali, sono più veloci da avviare e consentono un'allocazione delle risorse precisa e rapida. La teoria del controllo ha ricevuto notevole attenzione in quanto rappresenta una metodologia generale per la creazione di adaptive systems. Un software self-adaptive è in grado di allocare la giusta quantità di risorse necessarie per soddisfare un SLA. La teoria del controllo essere utilizzata per implementare applicazioni software self-adaptive che allocano la giusta quantità di risorse rispettando una deadline o soddisfacendo gli obiettivi di qualità del servizio (QoS) per applicazione. In questa tesi utilizziamo TensorFlow, uno dei più popolari framework per machine learning disponibile oggi.