Review of performance evaluation benchmarks of Apache hadoop

With the Internet and data growth increasing trends, big data is becoming an extremely important and challenging problem for Data Centers. Many platforms and frameworks are working to bring a cutting edge technology to this problem. Apache Hadoop is a software framework addressing the big-data processing and storing on clusters, providing reliability, scalability and distributed computing. Hadoop has a distributed file system to store vast amount of data in distributed environments, and uses MapReduce algorithm to perform the computations and process large amount of data, by parallelizing the workload and storage. In comparison to other relational database systems, Hadoop works well with unstructured data. Our work is focused on performance evaluation of benchmarks of Hadoop, which are crucial for testing the infrastructure of the clusters. Taking into consideration the sensitiveness and importance of data, it’s inevitable testing the clusters and distributed systems before deploying. The benchmark results can lead to optimizing the parameters for an enhanced performance tuning of the cluster. This thesis covers the necessary related topics of Hadoop and a comprehensive listing of benchmarks used to test Hadoop, while providing detailed information for their appliance and procedures to run them. We tested benchmarks in a virtual environment, with different parameters and options which yielded results that led to the conclusion of this thesis.

Review of performance evaluation benchmarks of Apache hadoop

PUSTINA, BLENDI

2013/2014

Abstract

With the Internet and data growth increasing trends, big data is becoming an extremely important and challenging problem for Data Centers. Many platforms and frameworks are working to bring a cutting edge technology to this problem. Apache Hadoop is a software framework addressing the big-data processing and storing on clusters, providing reliability, scalability and distributed computing. Hadoop has a distributed file system to store vast amount of data in distributed environments, and uses MapReduce algorithm to perform the computations and process large amount of data, by parallelizing the workload and storage. In comparison to other relational database systems, Hadoop works well with unstructured data. Our work is focused on performance evaluation of benchmarks of Hadoop, which are crucial for testing the infrastructure of the clusters. Taking into consideration the sensitiveness and importance of data, it’s inevitable testing the clusters and distributed systems before deploying. The benchmark results can lead to optimizing the parameters for an enhanced performance tuning of the cluster. This thesis covers the necessary related topics of Hadoop and a comprehensive listing of benchmarks used to test Hadoop, while providing detailed information for their appliance and procedures to run them. We tested benchmarks in a virtual environment, with different parameters and options which yielded results that led to the conclusion of this thesis.

Scheda breve

Scheda completa

	Relatore
	
				GRIBAUDO, MARCO
			
	Scuola / Dip.
	
				ING  - Scuola di Ingegneria Industriale e dell'Informazione
			
	Data
	
				24-lug-2014
			
	Anno accademico
	
				2013/2014
			
	Tipo di documento
	
				Tesi di laurea Magistrale
			
	Appare nelle tipologie:
	
				Tesi di laurea Magistrale

File allegati

File	Dimensione	Formato
PUSTINA_749598_ReviewOfPerformanceEvaluationBenchmarksHadoop.pdf Open Access dal 11/07/2015 Descrizione: Thesis text Dimensione 1.55 MB Formato Adobe PDF Visualizza/Apri	1.55 MB	Adobe PDF	Visualizza/Apri

I documenti in POLITesi sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10589/93418