Leveraging timely and differential dataflow for efficient RDFS reasoning

In recent Big Data applications, Data Variety and Data Velocity are becoming progressively more important. Big Data is often associated with the Data Volume dimension that deals with the size of the data subject to analysis. As the size increases new challenges arise. Data Variety deals with the heterogeneity of the input data: data may come from different sources that generate information from different and complex domains using different structures. On the other hand, applications may require timely constraint and well-specified latency bounds in the output of the analysis. Data Velocity encompasses these type of requirements. Comprehensive approaches able to tackle challenges from all three dimensions are subject to research and they are often described as ideal models. In reality, approaches try to narrow down requirements to address separately, usually resulting in trade-offs. For example, Data Stream Management Systems (DSMS) and Complex Event Processors (CEP) successfully tackle part of the challenges in Data Volume and Data Velocity, but they proved to have limitations in other applications where Data Variety constraints are critical. The purpose of this thesis work is to provide a new approach that collocates in the aforementioned scenario. In particular, we will start from high level requirements and progressively narrow down the scope of the problem in order to be able to formulate a more specific research problem. The aim of the thesis is to provide a working system able to perform Stream Reasoning from a Graph Stream Processing point of view. To achieve this we use Timely Dataflow, a framework written in Rust that proved great results when dealing with Graph Processing applications. Timely Dataflow introduces a new model of computation based on the Dataflow paradigm oriented towards efficient data-driven computation. We use DynamiTE as a reference system to approach aspects in this research field and implement its tasks with Timely Dataflow. In this thesis work we will provide the design and the implementation of the system. We will make performance evaluations and compare them to DynamiTE. The system provides highly-generic interfaces to favor extensibility. This, along with the performance evaluation results, will allow us to define directions to further improve performance and extend the system capabilities.

Nelle piu' recenti applicazioni dei Big Data, la Varieta' e la Velocita' dei dati assumono un ruolo sempre piu' influente. Spesso i Big Data vengono associati al Volume dei dati soggetti ad analisi. Con l’aumentare del volume dei dati, si puo' incorrere ad una serie di nuovi problemi. La Varieta' dei dati si occupa dell’eterogeneita' dei dati in input: i dati possono provenire da sorgenti diverse che generano informazione appartenente a domini di natura diversa e complessa, strutturando i dati in maniera differente. La Velocita' dei dati si occupa di quella serie di requisiti riguardanti il tempo. Alcune applicazioni possono richiedere requisiti di latenza ben definiti per funzionare correttamente. Approcci in grado di affrontare requisiti in tutte e tre le dimensioni sono tuttora soggetto di ricerca e vengono definiti in termini di modelli ideali. In approcci piu' realistici, viene considerato un sottoinsieme di requisiti, operando spesso in termini di trade-off. Ad esempio, i Data Stream Management Systems (DSMS) e i Complex Even Processors (CEP) sono in grado di fornire soluzioni a parte dei requisiti di Volume e Velocita' dei dati, ma si mostrano limitati quando impiegati in applicazioni in cui i requisiti di Varieta' sono centrali. Questo lavoro di tesi si pone l’obiettivo di proporre un nuovo approccio nello scenario appena descritto. Partiremo definendo dei requisiti di alto livello per inquadrare la specifica collocazione di tale approccio, per poi restringere progressivamente i requisiti per poter formulare un problema di ricerca concreto. Questo approccio verra' supportato con lo sviluppo di un sistema in grado di eseguire operazioni di Stream Reasoning da un punto di vista di Graph Stream Processing. Utilizzeremo Timely Dataflow, un framework scritto in Rust che offre risultati promettenti quando applicato ad algoritmi di Graph Processing. Timely Dataflow e' basato sul modello di computazione Dataflow per favorire l’efficienza della computazione data-driven. Useremo DynamiTE come riferimento per approcciare le sfide che tale area di ricerca pone, implementandone le funzionalita’ principali. In questo lavoro di tesi descriveremo il design e l’implementazione del sistema, misurandone le performance, per poi comparare i risultati con DynamiTE. Il sistema offre un’interfaccia generica che ne favorisce l’estensibilita'. Utilizzando questa caratteristica e i risultati forniti dalla misurazione delle performance saremo in grado di delineare le direzioni da intraprendere per migliorare le performance ed estendere le funzionalita' del sistema.