Comparing two approaches for distributed SPARQL querying : vertex-centric graph exploration and edge-centric relational joins

The Resource Description Framework (RDF) provides a way to represent knowledge in graph form. SPARQL is the language used to query such data;by finding all possible matches of a given query graph over the RDF graph. Over the years;the size of RDF datasets on the web has grown to millions of entities connected by billions of edges, prompting for the development of a large number of distributed RDF stores and query engines to handle very large graphs. These systems differ in many aspects: from storage strategies, to indexing, communication, partitioning and join processing. In this landcape, we identify a trend to use one of two query processing approaches: relational joins and graph exploration. The former tends to decompose the query in individual edges and uses typical RDBMS operators to join their individual results. Therefore we call it an edge-centric approach. The latter aims at better exploiting the graph nature of RDF data. Queries are processed by visiting the vertices of the RDF graph following the shape of the query graph, and collecting results along the way via message passing. This qualifies it as a vertex-centric approach. This thesis compares these two approaches and investigates how other design decisions -- such as indexes, query plan shape and pruning strategies -- positively or negatively influence them, with particular focus on a distributed setting. For a fair comparison, we concentrate on two ad-hoc implementations -- also described here -- that share the same technological stack and include typical aspects of each paradigm, but we also include three state of the art engines that implement different ideas. Our analysis shows that graph-based methods better suit the unstructured nature of RDF data and the distributed setting. We also find out that the success of some relational-based engines is largely due to their ability to shift away from some typical assumptions of relational query processing and adapt to the peculiarities of RDF.

Il Resource Description Framework (RDF) fornisce un modello a grafo per rappresentare la conoscenza. SPARQL è il linguaggio utilizzato per interrogare RDF, fornendo tutte le possibili sovrapposizioni del grafo di query su quello dei dati. Nel corso degli anni i dataset RDF sono arrivati ad includere milioni di vertici collegati da miliardi di archi. Questo ha portato allo sviluppo di un grande numero di sistemi distribuiti atti a memorizzare ed interrogare enormi grafi. Essi differiscono in molti aspetti, tra cui memorizzazione, indicizzazione, comunicazione, partizionamento e processamento delle join. In questo panorama, identifichiamo principalmente due approcci di elaborazione delle query: join relazionali ed esplorazione del grafo. Il primo tende a decomporre la query in singoli archi e usa operatori tipici dei RDBMS, per questo lo definiamo edge-centric. Il secondo è orientato a sfruttare meglio la natura dei dati RDF. Le query vengono proccessate visitando i vertici del grafo RDF seguendo la struttra imposta dal grafo della query, e raccogliendo i risultati lungo il cammino tramite scambio di messaggi. Questo lo rende un metodo vertex-centric. Questa tesi compara i due approcci e analizza come altre scelte di progettazione -- come indici e struttura dei piani di query -- possano influenzarli positivamente o negativamente, con particolare attenzione ad un contesto distribuito. Per rendere la comparazione equa, si concentra principalmente su due implementazioni ad-hoch -- anch'esse qui descritte -- che presentano aspetti tipici di ciascun paradigma, ma condividono la stessa tecnologia sottostante. A questi affianchiamo anche tre sistemi allo stato dell'arte, che differiscono in alcuni aspetti. La nostra analisi mostra che i metodi basati su grafi si adattano meglio sia alla natura non strutturata dei dati RDF che al contesto distribuito. Scopriamo anche che il successo di alcuni sistemi basati sull'approccio relazionale è spesso dovuto al discostamento da alcune assunzioni tipiche dei database relazionali, per poi adattarsi al modello RDF.