Parallel Implementations of the Skyline Query using PySpark

Biblioteche e Archivi
POLITesi - Archivio digitale delle tesi di laurea e di dottorato

The Skyline operator is an extension of the SQL language which returns a set of interesting points from a given dataset, first proposed in 2001. Since then, its high time complexity and the prevalence of the algorithm have made it interesting to explore parallel solutions which compute the skyline of a dataset in the least time possible. Many different implementations have been suggested mostly in the Map-Reduce framework, but few of these solutions have been brought in the Spark framework. As such, we will bring some of the solutions to PySpark and propose different optimizations which can be carried out, that significantly reduce the query’s time in big datasets, with the most effective being the “Representative filtering” method.

La query Skyline è un’estensione del linguaggio SQL il quale restituisce un set di punti interessanti da un mole di dati, proposto per la prima volta nel 2001. L’alta complessità temporale dell’algoritmo ha reso interessante la ricerca di algoritmi che calcolano in parallelo lo Skyline nel minor tempo possibile. Diverse implementazioni paralleli sono state suggerite utilizzando il framework Map-Reduce, ma non è stato quasi mai esplorato l’utilizzo del framework Spark. Noi abbiamo portato diverse soluzioni su PySpark per paragonarli con i diversi metodi di ottimizzazione che si possono usare per ridurre il tempo d’esecuzione dell’algoritmo, tra i quali quello più importante chiamato “Representative filtering”.

Parallel Implementations of the Skyline Query using PySpark

Pinari, Etion

2021/2022

Abstract

The Skyline operator is an extension of the SQL language which returns a set of interesting points from a given dataset, first proposed in 2001. Since then, its high time complexity and the prevalence of the algorithm have made it interesting to explore parallel solutions which compute the skyline of a dataset in the least time possible. Many different implementations have been suggested mostly in the Map-Reduce framework, but few of these solutions have been brought in the Spark framework. As such, we will bring some of the solutions to PySpark and propose different optimizations which can be carried out, that significantly reduce the query’s time in big datasets, with the most effective being the “Representative filtering” method.

Scheda breve

Scheda completa

	Relatore
	
				MARTINENGHI, DAVIDE
			
	Scuola / Dip.
	
				ING  - Scuola di Ingegneria Industriale e dell'Informazione
			
	Data
	
				20-dic-2022
			
	Anno accademico
	
				2021/2022
			
	Abstract in italiano
	
				La query Skyline è un’estensione del linguaggio SQL il quale restituisce un set di punti interessanti da un mole di dati, proposto per la prima volta nel 2001. L’alta complessità temporale dell’algoritmo ha reso interessante la ricerca di algoritmi che calcolano in parallelo lo Skyline nel minor tempo possibile. Diverse implementazioni paralleli sono state suggerite utilizzando il framework Map-Reduce, ma non è stato quasi mai esplorato l’utilizzo del framework Spark. Noi abbiamo portato diverse soluzioni su PySpark per paragonarli con i diversi metodi di ottimizzazione che si possono usare per ridurre il tempo d’esecuzione dell’algoritmo, tra i quali quello più importante chiamato “Representative filtering”.
			
	Appare nelle tipologie:
	
				Tesi di laurea Magistrale

File allegati

File	Dimensione	Formato
Etion Pinari Thesis 2021-2022 - ver 2.4.pdf accessibile in internet per tutti Dimensione 2.05 MB Formato Adobe PDF Visualizza/Apri	2.05 MB	Adobe PDF	Visualizza/Apri
Extended Summary.pdf accessibile in internet per tutti Dimensione 470.27 kB Formato Adobe PDF Visualizza/Apri	470.27 kB	Adobe PDF	Visualizza/Apri

I documenti in POLITesi sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10589/201822