A distributed and scalable vertex-centric SPARQL engine : design, implementation and optimization

Semantic Web has as goal to make data available on the web readable by machines, possibly under a unified standard. Data are expressed as Knowledge Graphs, where nodes represent entities and edges represent relationships between entities. In practice, RDF is the standard used to build the graph, while SPARQL is the language used to query it. The increasing spreading of these technologies has led to the rising of many engines that are able to store and query those datasets, usually having a centralized architecture. However, RDF graphs in the real-world scale can reach very big dimensions, with billions of entities and edges, making it difficult to be handled by a single machine. A distributed system, instead, answers this needing with the ability to easily scale up. The purpose of this work is to present a distributed SPARQL engine which is based on Pregel vertex-centric architecture, with a particular focus on the custom optimizations introduced. Those, particularly, are focused on indexing, a key aspect to fast access RDF data, and optimal query resolution strategy, that can be achieved by analyzing the query graph and identifying the best order in which evaluate triple patterns. Results collected shows that the solution presented is scalable, an important feature for distributed systems, and efficient in terms of query resolution time, and furthermore competitive with other engines already established in this field. Being an in-memory system, we also analyzed the loading performance of datasets, in which VERNE is very efficient.

Il Web Semantico ha l'obiettivo di rendere leggibili dali sistemi informatici i dati disponibili sul web, possibilmente usando uno standard unico. I dati sono espressi sotto forma di Knowledge Graph, dove i nodi rappresentano le entità e gli archi rappresentano le relazioni tra le entità. In pratica, RDF è lo standard utilizzato per costruire il grafo, mentre SPARQL è il linguaggio utilizzato per interrogarlo. La crescente diffusione di queste tecnologie ha portato alla nascita di molti sistemi in grado di archiviare e interrogare tali dataset, solitamente con un'architettura centralizzata. Tuttavia, i grafi RDF in uno scenario reale possono raggiungere dimensioni molto grandi, con miliardi di entità e archi, rendendo difficile gestire il carico di lavoro su un singolo server, un sistema distribuito, invece, risponde a questa esigenza con la possibilità di scalare facilmente. Lo scopo di questo lavoro è presentare un sistema SPARQL distribuito basato su un'architettura Pregel vertex-centric, con un'attenzione particolare ad ottimizzazioni specifiche introdotte. Queste, in particolare, sono focalizzate sull'indicizzazione, un aspetto chiave per l'accesso rapido ai dati RDF, e sulla strategia ottimale di risoluzione delle query, che può essere ottenuta analizzandone il grafico ed identificando l'ordine migliore in cui valutare i triple patterns. I risultati raccolti mostrano che la soluzione presentata è scalabile, caratteristica importante per i sistemi distribuiti, ed efficiente in termini di tempo di risoluzione delle query, ed inoltre anche competitiva con altri motori già affermati in questo campo. Essendo un sistema con caricamento in memoria centrale, abbiamo anche analizzato le prestazioni di caricamento dei dataset, in cui VERNE è molto efficiente.