Comparazione tra framework di elaborazione big data

Objective of this work is to handle big data with a comparison of functionalities and prestactions of some technologies that are used to extract useful knowledge from data. This work is composed by three steps: big data environment installation and configuration, dataset analysis with query planning and query execution on a cluster composed by three connected PCs. The used technologies are Map-Reduce, Hive and Spark and they are used for the executionf of three analysis on a PAM dataset that contains the selling data of some markets. Theese analysis want to compare prestactions of two kind of products that are present in PAM markets: fresh fish products and frozen wish products. Theese analysis will find that the first category is superior and from a technological point of view Spark is the framwork with the lowest time for query execution. Anyway the test has showed that Spark has some problems when the intermediate dataset born during an elaboration becomes huge in respect of RAM dimension. For the comparison between MapReduce and Hive, the first is the simplest framework for big data analysys, but the Map-Reduce complexity allows the introduction of optimizations that can reduce the execution time. In conclusion this work find thata Spark needs much more resources to work well, and it has some problem in cluster where there are different fraworks for data elaboration. Map-Reduce is complex to use buti t can work with a medium/low level of hardware and it can return good prestaction.

Questo lavoro si prefigge il compito di affrontare la tematica dei big data andando a confrontare prestazioni e funzionalità di alcune tecnologie il cui obbiettivo è l’estrazione di conoscenza utile da dati non strutturati e voluminosi. Il lavoro verrà eseguito attraverso tre fasi: installazione di un ambiente di mantenimento ed elaborazioni big data, analisi di un dataset con pianificazione di alcune analisi da svolgere e realizzazione ed esecuzione delle stesse su un cluster formato da tre macchine connesse in rete. Le tecnologie selezionate sono Map-Reduce, Hive e Spark e saranno utilizzate per eseguire tre analisi su un dataset fornito da PAM. Le analisi svolte si prefiggono l’obbiettivo di confrontare le prestazioni commerciali di due categorie di prodotti vendute nella catena di supermercati PAM: i prodotti di pesce fresco e i prodotti di pesce surgelato. Le analisi hanno portato all’identificazione di una superiorità nelle prestazioni dei prodotti di pesce fresco e dal punto di vista tecnologico Spark è stato il framework che ha ottenuti i tempi di esecuzione minori. Il lavoro di test ha però dimostrato che Spark presenta alcune problematiche quando il dataset intermedio da elaborare presenta dimensioni non totalmente contenibili in memoria centrale. Per quanto riguarda Hive e Map-Reduce, la prima tecnologia si è dimostrata la più semplice da utilizzare ma la complessità di Map-Reduce consente al programmatore di apportare diverse ottimizzazioni che permettono di avere tempi di risposta più contenuti di Hive. In conclusione è stato stabilito che Spark, per funzionare al meglio, richiede elevate dotazioni di risorse e non risulta ottimale in cluster in cui coesistono diversi framework di elaborazione big data mentre Map-Reduce si conferma una tecnologia piuttosto complessa ma che può portare buoni risultati anche in cluster composti da hardware di livello medio/basso.