Towards a top-k SPARQL query benchmark generator

Top-k queries which returning the top k results ordered according to a user-defined scoring function are gaining more and more attention in the Database and Semantic Web communities. Order is an important property that can be exploited to speed up query processing. Also, state-of-theart SPARQL engines typically do no exploit order for query optimization purposes: Top-k queries are mostly managed with a materialize-then-sort processing scheme that computes all the matching solutions (e.g. thousands) even if only a limited number k (e.g. ten) are requested. Recent works have shown that an efficient split-and-interleave processing scheme could be adopted to improve the performance of top-k SPARQL queries. A consistent comparison of those works does not exist in current literature. As often occurs, the main cause for this fragmentation resides in the lack of a SPARQL benchmark covering top-k SPARQL queries. To foster the work on top-k query processing within the Semantic Web community, we believe that it is the right time to define a top-k SPARQL benchmark. For the benchmark to be meaningful, at least two requirements should hold: 1) the benchmark should resemble reality as much as possible, and 2) it should stress the features that distinguish top-k queries from traditional SPARQL queries both from a syntactic perspective, i.e., queries should contain rank-related clauses, and from a performance perspective, i.e. the query mix should insist on different characteristics of the queried data, which are central to top-k query answering, so to stress the evaluated systems in several running conditions. Recent works on SPARQL query benchmarks help satisfying the first requirement. In this thesis, I investigate the second requirement, by extending DBpedia SPARQL Benchmark (DBPSB) [Morsey et al., 2011] with the capabilities required to compare SPARQL engines on top-k queries. The Top-k DBpedia SPARQL Benchmark (Top-k DBPSB) proposed in this thesis uses the same dataset, performance metrics, and test driver of DBPSB. The innovative part of my work consists in an algorithm to create the Top-k queries from the Auxiliary queries of the DBPSB and its datasets. To generate top-k queries, my algorithm needs to have rankable variable that can be used in the scoring function part of the Top-k queries and some statistical information about the dataset. To this end, I developed an approach that explores the dataset collecting meaningful rankable variables and statistics. Once all the required information is available, it is possible to generate top-k queries for each DBPSB Auxiliary queries. My algorithm randomly generates the scoring function and adds ORDERBY and LIMIT clause to complete the Top-k queries. We have four research hypothesis: 1) the more rankable objects, the more execution time, 2) the more rankable predicates, the more execution time, 3) the more selectivity, the less execution time, and 4) the more the limit value, the same execution time (except ARQ Rank).We run the Top-K DBPSB against different four SPARQL Engines. The results of the extensive experimental evaluation confirm hypothesis 2) and 4) but do not confute the other two assumptions 1) and 3), which require further research.

Le query top-k query che ritornano i k migliori risultati ordinati secondo una funzionedefinita dall’utente stanno ottenendo sempre più attenzione nelle comunità delle basi di dati e del SemanticWeb. L’ordine è una proprietà importante che può essere sfruttata per accelerare l’elaborazione delle query, ma, allo stato dell’arte, i motori SPARQL in genere non usano l’ordine allo scopo di ottimizzarel’elaborazione dellequery: le query top-k sono per lo più elaborateprima materializzando i risultati e poi ordinandoli (materialize-then-sort). In questo modo il motore SPARQL prima calcola tutte le soluzioni corrispondenti (ad esempio migliaia) anche se ne sono richieste solo un numero k limitato (ad esempio dieci). Lavori recenti hanno proposto un approccio differente alla valutazione di query top-k:split-and-interleave. Questo metodo prima divide la funzione di ordinamento in parti e poi fa in modo di valutarla via via che vengono trovati i risultati e non solo alla fine.Questo metodo si è dimostrato in grado di migliorare le prestazioni delle query top- k sia nei database che nel Semantic Web. Nella letteratura corrente non esiste, però, un lavoro di valutazione comparativa di motori SPARQL per quel che riguarda le query top-k. Come spesso accade, la causa principale di questa frammentazione risiede nella mancanza di un benchmark. Per favorire lo sviluppo della ricerca sull’ottimizzazione diquery top-k in SPARQL, crediamo che sia giunto il momento di definire un benchmarkdi querySPARQL top- k. Un benchmark per essere significativo, dovrebbero avere almeno due requisiti: 1 ) dovrebbe essere simile, per quanto possibile, alla realtà e 2) dovrebbe insistere sulle caratteristiche che distinguono le query top- k dallequery SPARQL tradizionali sia dal punto di vista sintattico, vale a dire, le query devono contenere clausole di ordinamento, sia dal punto di vista delle prestazioni, ovvero il mix di query dovrebbe stressare i motori SPARQL là dove sono più deboli nel valutare le query top-k. In questa tesi ho investigato la seconda condizione, estendendo il DBpedia SPARQL Benchmark (DBPSB) [Morsey et al., 2011] con le capacità necessarie per confrontare motori SPARQL su query top-k. Il benchmark da me proposto usa lo stesso insieme di dati, gli stessi indicatori di prestazione, e gli stessi test driver di DBPSB. La parte innovativa del mio lavoro di tesi consiste in un algoritmo per creare query top-k a partire dalle query ausiliarie di DBPSB. Per generare le query top-k il mio algoritmo ha bisogno di conoscere un insieme di variabili rispetto a cui ordinare (rankable variabile) i risultati delle query query top-k e di alcune informazioni statistiche sull’insieme di dati. A questo scopo ho sviluppato un approccio che esplora la collezione di dati collezionando sia le statistiche che le rankable variabile. Una volta che l’informazione necessaria a generare le query top-k è disponibile, per ogni query ausiliaria di DBPSB il mio algoritmo genera in modo casuale la funzione di ordinamento e aggiunge alla query le clausole ORDER BY e LIMIT. L’algoritmo assume che: 1) maggiore sia il numero di oggetti diversi da ordinare, maggiore sia il tempo di esecuzione; 2) maggiori siano il numero di rankable variable, maggiore sia il tempo di esecuzione; 3) più alta sia la selettività dei predicati rispetto a cui si ordina, minore sia il tempo di esecuzione; e 4) la clausola limit sia ininfluente. Ho comparato con il benchmark proposto quattro motori SPARQL. La valutazione sperimentale conferma le ipotesi 2) e 4), ma non permette di dare una risposta definitiva rispetto alle ipotesi 1) e 3).