Creating a simulation framework for the hadoop distributed file system inside heterogeneous cloud environments by using CloudSim

With the rapid innovation of computer technology in recent years, the paradigm of cloud computing has established itself. Cloud computing is the on-demand availability of computer system resources, mainly (but not only) for the purposes of computing power and data storage. When we talk about cloud computing, we refer to big data centers, the big cloud infrastructures are distributed over multiple data centers, arranged over different locations. The user can interact with these data centers through the Internet, and it is exactly in this kind of environments that architectures like HDFS™ are implemented. HDFS is a distributed file system, part of the many modules of the Apache™ Hadoop® project. A distributed file system is a file system that spans over multiple machines, and it is meant to be accessed by many users at the same time. It is able to exploit the disks of the multiple machines connected to the same network, creating a common resource pool. The user is able to access the file system, and perform operations on data, without ever needing to know what is happening in the background inside the cloud infrastructure. HDFS is one of the currently most popular distributed file systems, and while it shares many similarities with the others, there are some significant key aspects that make it unique: HDFS is highly fault-tolerant and it is designed to be deployed on low-cost hardware. At the same time, HDFS provides high throughput access to data and is suitable for applications that have large data sets. The objective of this thesis is to model the behavior of HDFS inside CloudSim, an easily extensible simulation framework, developed by the Cloud Computing and Distributed Systems (CLOUDS) Laboratory, University of Melbourne. CloudSim makes it possible to model and simulate cloud computing infrastructure and services, and estimate their performance. CloudSim allows us to configure a network made of one or multiple data centers, each one made of one or many hosts, and inside these hosts we can execute virtual machines (VMs). It is then possible to modify to our liking various different configuration parameters, and therefore execute, inside the VMs, simulations of workloads, called cloudlets, with the purpose of evaluating their performance. However, as a result of my study of CloudSim, I was able to understand that, in the current version, the framework is not able to estimate the performance of file transfers over the network, and it is also unable to accurately estimate the performance of disks whenever there are multiple operations performed on them at the same time (concurrency). My work will thus expand CloudSim, with the purpose of implementing a simulation of the behavior of HDFS inside a network of machines, creating a new framework, which has been named CloudSim-HDFS. This is done by following as closely as possible the design philosophy of CloudSim and extending the pre-existing code, without altering any of the pre-existing functionalities. In the future, if the CloudSim framework were to be updated to accurately estimate the performance of network transfers and disk operations, the work done in this thesis can be used not only to just model HDFS behavior in cloud environments, but also to estimate its performance, directly inside of CloudSim, with relative accuracy. In order to demonstrate the functionalities implemented inside my work, I will present and discuss a comprehensive scenario, representing all of the key aspects of a simulation of proper HDFS behavior.

Con la rapida crescita tecnologica in questi ultimi anni, si è affermato il paradigma del cloud computing: la disponibilità on demand di risorse informatiche tramite rete, principalmente ai fini di computing power o storage di dati. Quando si parla di cloud computing ci si riferisce a grandi data centers, e i grandi clouds si distribuiscono su più data centers diversi, disposti in località separate. L'utente può interfacciarsi a questi tramite rete, ed è proprio su questi data centers che vengono implementate architetture come HDFS™, un distributed file system, parte dei vari moduli dell'Apache™ Hadoop® project. Un distributed file system è un file system che si estende su più nodi (macchine), ed è pensato per essere utilizzato da più utenti allo stesso tempo. È capace di sfruttare i dischi dei tanti nodi connessi alla stessa rete, creando un unico grande pool di risorse comuni. L'utente è in grado di accedere al file system, ed eseguire operazioni sui dati, senza necessitare di essere al corrente di quello che succede nel background nella infrastruttura del cloud. HDFS è al momento uno dei distributed file systems più popolari, e nonostante condivida con gli altri molte similarità, ci sono degli aspetti significativi che lo rendono unico: HDFS è altamente fault-tolerant ed è pensato per essere allocato su hardware a basso costo. Allo stesso tempo, HDFS fornisce un alto throughput in termini di accesso ai dati ed è pensato per applicazioni che operano su grandi data sets. L'obiettivo di questa tesi è di modellare il comportamento di HDFS all'interno di CloudSim, un framework, facilmente estendibile, sviluppato dal Cloud Computing and Distributed Systems (CLOUDS) Laboratory, dell'università di Melbourne. Cloudsim permette di modellare e simulare infrastrutture e servizi di cloud computing e stimarne le performance. CloudSim permette di configurare un network fatto da uno o più data centers, composti da hosts, all'interno dei quali vengono eseguite virtual machines (VMs), ed è possibile modificare a proprio piacimento svariati parametri di configurazione, ed eseguire quindi all'interno delle VMs simulazioni di carichi di lavoro, detti cloudlets, al fine di valutarne le performance. A seguito del mio studio di CloudSim, tuttavia, ho potuto constatare che, nella versione corrente, il framework non è capace di valutare le performance di trasferimenti di carichi di dati sulla rete, e non è capace di valutare correttamente le performance dei dischi nel caso in cui su di essi vengano eseguite più operazioni in contemporanea (concorrenza). Il mio lavoro andrà quindi ad espandere CloudSim, per implementare una simulazione del comportamento di HDFS all'interno di un network di macchine, creando un nuovo framework, chiamato CloudSim-HDFS. Questo lavoro è stato svolto seguendo nel modo più fedele possibile la filosofia di design di CloudSim, estendendo il codice preesistente, senza alterare nessuna delle funzionalità già presenti. In futuro, quando il framework verrà aggiornato in modo tale che vengano accuratamente stimate le performance di rete e dischi, questo lavoro potrà essere usato non solo per modellare il comportamento di HDFS in ambienti cloud, ma anche per stimarne le performance, direttamente all'interno di CloudSim, con buona precisione. Al fine di mostrare le funzionalità introdotte dal mio lavoro, presenterò e discuterò uno scenario completo ed esauriente, che dimostrerà tutti gli aspetti chiave di una simulazione di un corretto comportamento di HDFS.