During the last few decades, Big Data has become the key trend in many different areas. As it becomes more popular, many big data tools are emerged to meet the requirements of Big Data applications. Apache Spark is one of these tools which is comparably the best from many different perspectives. When it comes to the performance analysis of the big data tools, performance evaluation techniques play an important role. These techniques help to understand the performance capabilities, guide the development of applications and so on. This thesis applies different feature selection techniques together with the regression models to provide good predictions for Spark application execution times. The analyses are performed with the data from Spark Deep Learning Pipelines framework. The Spark benchmarking suite is extended to enable deep learning methods within the benchmark. The data for the experiments are generated with this extension. In order to use ML techniques, a machine learning library is implemented. In addition to the data from the Spark benchmarking suite, the experiments are also performed with the data created by the TPC-DS framework.
Negli ultimi decenni, i Big Data sono diventati la tendenza chiave in molte aree diverse. Man mano che diventa più popolare, molti strumenti di big data sono emersi per soddisfare i requisiti delle applicazioni Big Data. Apache Spark è uno di questi strumenti che è comparabilmente il migliore da molte prospettive diverse. Quando si tratta dell'analisi delle prestazioni dei grandi strumenti di dati, le tecniche di valutazione delle prestazioni svolgono un ruolo importante. Queste tecniche aiutano a comprendere le capacità delle prestazioni, guidano lo sviluppo di applicazioni e così via. Questa tesi applica diverse tecniche di selezione delle caratteristiche insieme ai modelli di regressione per fornire buone previsioni per i tempi di esecuzione dell'applicazione Spark. Le analisi vengono eseguite con i dati del framework Spark Deep Learning Pipelines. La suite di benchmarking di Spark viene estesa per consentire metodi di apprendimento approfondito all'interno del benchmark. I dati per gli esperimenti sono generati con questa estensione. Per utilizzare le tecniche ML, è implementata una libreria di apprendimento automatico. Oltre ai dati della suite di benchmarking Spark, gli esperimenti vengono eseguiti anche con i dati creati dal framework TPC-DS.
Feature selection and machine learning models for big data performance evaluation
SAPLIK, DEMET SUDE
2017/2018
Abstract
During the last few decades, Big Data has become the key trend in many different areas. As it becomes more popular, many big data tools are emerged to meet the requirements of Big Data applications. Apache Spark is one of these tools which is comparably the best from many different perspectives. When it comes to the performance analysis of the big data tools, performance evaluation techniques play an important role. These techniques help to understand the performance capabilities, guide the development of applications and so on. This thesis applies different feature selection techniques together with the regression models to provide good predictions for Spark application execution times. The analyses are performed with the data from Spark Deep Learning Pipelines framework. The Spark benchmarking suite is extended to enable deep learning methods within the benchmark. The data for the experiments are generated with this extension. In order to use ML techniques, a machine learning library is implemented. In addition to the data from the Spark benchmarking suite, the experiments are also performed with the data created by the TPC-DS framework.File | Dimensione | Formato | |
---|---|---|---|
2018_10_Saplik_Demet_Sude.pdf
non accessibile
Descrizione: Thesis text
Dimensione
11.91 MB
Formato
Adobe PDF
|
11.91 MB | Adobe PDF | Visualizza/Apri |
I documenti in POLITesi sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/10589/142907