Improving reservation based scheduling tools in Hadoop

From the last decade data and their use are fundamental in the new type of economy, based on mining the user’s data. Companies try to collect much data as possible, to provide services and helping users in every day actions. The challenge is to save these data and mining them as fast as possible, to provide a quick response to today’s busy life. Hadoop is a framework able to process a large data sets across clusters of computers using a simple programming models. It uses an Hadoop Distributed File System to abstract the physical storage of data, that are spread into different part of the cluster. On top of the HDFS, there is YARN, a negotiator that manages all the resources in the cluster. It provides the allocation of computational resources for jobs that are submitted by the users. Jobs that run on Hadoop can be divided in production and best-effort jobs. They have different type of SLAs and requests. The goal of the Reservation Scheduler is to make a predictable resource allocations to satisfy production job SLAs, try to minimize best-effort job latency, and achieve high-cluster utilization. The topic of this thesis is the implementation of two features to support the reservation system on Hadoop. The first is a graphical interface that show the plan for the resources allocate in the cluster. The second feature concern GridMix and its use to test load jobs with reservation of resources.

Dal decennio scorso i dati forniti dagli utenti sono fondamentali per l’economia attuale, basata sul trovare pattern all’interno degli stessi. Le compagnie cercano di collezionare più dati possibili, per fornire servizi che aiutano gli utenti nella vita di tutti i giorni. La nuova sfida è quella di salvare tutti questi dati, e trovare informazioni nascoste, per fornire risposte nel minore tempo possibile adeguate alla vita frenetica giornaliera. Hadoop è un framework che processa grandi insieme di dati sparsi all’interno di un cluster, attraverso un semplice modello di programmazione. Hadoop fornisce un astrazione del file system, che permette di immagazzinare dati in modo distribuito all’interno del cluster. Sopra a questo livello, c’è YARN, un negoziatore dinamico che amministra tutte le risorse computazionali disponibile all’interno del server. I lavori che possono essere sottomessi in Hadoop hanno diverse esigenze di tempo e di risorse. Lo scheduler basato sulla prenotazione cerca di fornire un’allocazione predicibile per soddisfare gli SLA dei lavori di produzione, minimizzare le latenze dei lavori best-effort e di ottenere il maggiore utilizzio di tutte le risorse del cluster. Il contributo principale di questa tesi è l’implementazione di due features che supportano lo scheduler basato sulla prenotazione. La prima è la creazione di un’interfaccia grafica per la visualizzazione del piano di allocazione delle risorse, mentre il secondo riguarda GridMix e il suo per testare carichi di lavoro con prenotazione.