Dynamic query optimization in spark

Considering the relevant role of data in the last decades and taking into account the increase of both its volume and velocity, query optimization on distributed massive parallel processing systems is now a hot topic. Among distributed query engines Spark is one of the most relevant ones: Spark was originally introduced as an abstraction layer over MapReduce. After the introduction of the SparkSQL module, it however became one of the most used distributed SQL query engines. In Spark SQL the task of optimizing queries is assigned to the Catalyst, which applies a two-step optimization procedure. At first, the tree representing the query is optimized in a rule-based fashion. Afterward, the optimized logical plan goes through a cost-based optimization phase, responsible mainly for selecting the join strategies. The Catalyst's optimizations, however, are purely static; as such any change to the data that might occur during the execution of a query cannot be taken into account. To bridge this gap AQE, Adaptive Query Execution, was introduced. AQE presents three main dynamic optimization procedures: dynamic join strategy selection, automatic post shuffle coalesce and skewed join handling. Despite the possibilities in the optimization field introduced by these functionalities, there is a distinct lack of information regarding AQE's inner workings. This entails a lack of insight regarding the actions that AQE undertakes when optimizing workloads. The goal of this thesis was to explore the inner workings of AQE. This was achieved via two sets of tests: the baseline set and the workload set. The baseline set saw the usage of suit-tailored data and workloads to study AQE's capabilities in an isolated fashion to fully understand how they operate. Once the first set of experiments had provided a solid understanding baseline over AQE the workload set of experiments explored how AQE interacts with production workloads. During these experiments various AQE capabilities, in different combinations and with different settings, have been studied. Both AQE's effects on Spark's efficiency and how it interacted and modified the standard query execution plan have been taken into account. The results found during the baseline tests, united with the findings in the tests over the two production workloads, have provided many insights and knowledge about AQE, thus enabling the user to better understand AQE and make better use of it.

Considerando il ruolo sempre più rilevante che i dati hanno assunto negli ultimi decenni e considerando l'aumento in volume e velocità nella loro produzione, la questione dell'ottimizazione delle query su sistemi distribuiti ha assunto un ruolo di grande rilevanza. Trai sistemi per query distribuiti più diffusi spicca Spark: originariamente creato come layer d'astrazione su MapReduce, con l'introduzione del modulo SparkSQL è diventato uno degli strumenti più diffusi per le query SQL distribuite. In Spark SQL l'ottimizzazione delle query è compito del Catalyst, il quale applica due layer di ottimizazione: in primo luogo esegue un'ottimizzazione rule-based, mentre in seconda battuta esegue un'ottimizzazione cost-based atta principalmente a stablilire la strategia meno costosa per eseguire i join. Le ottimizzazioni attuate dal Catalyst sono, però, statiche e non tengono conto delle eventuali variazioni sul dato che possono avvenire durante l'esecuzione della query. Onde ovviare a questa mancanza è stato introdotto AQE, Adaptive Query Execution, il quale introduce capacità d'ottimizzazione dinamica per il Catalyst. In Spark 3.1.2 le capacità di AQE sono 3: cambiare dinamicamente la strategia scelta per i join, performare automaticamente la coalesce delle partizioni dopo una fase di shuffle e gestire join che presentino skewness. Nonostante le possibilità di ottimizzazione derivanti da queste funzionalità, le informazioni riguardo al funzionamento di AQE sono scarse, rendendo quindi difficile analizzare perchè AQE agisca su di un workload in una determinata maniera. Questa tesi si è posta lo scopo di esplorare il funzionamento interno di AQE: ciò è stato fatto tramite due gruppi di test. Il primo gruppo è andato ad isolare le capacità di AQE, testandole singolarmente su dei workload e del dato costruiti appositamente per ingaggiare la feature d'interesse. Una volta compreso il funzionamento interno di AQE il focus si è spostato sul testare l'interazione con dei workload correntemente utilizzati in produzione. Durante questi test varie capacità di AQE sono state testate, in più combinazioni e con diverse configurazioni dei parametri. I risultati ottenuti dai test sui workload, uniti ai risultati ottenuti nella prima fase di test, hanno restituito un quadro più chiaro del funzionamento di AQE, permettendo di valutarne l'impatto in maniera approfondita e, quindi, di poterlo utilizzare più efficacemente.