On relevant query answering over streaming and distributed data

Web applications that join streaming with distributed data to provide relevant answers are getting a growing attention in recent years. Answering in a timely fashion, i.e., reactively, is one of the most important performance indicators for those applications. The Semantic Web community showed that RDF Stream Processing (RSP) is an adequate framework to develop this type of applications. However, remaining reactive can be challenging, especially when the distributed data is slowly evolving, because accessing the distributed data can be highly time consuming as well as rate-limited. State-of-the-art work addresses this problem by proposing an architectural approach that keeps a local replica of the distributed data. The local replica progressively becomes stale if not updated to reflect the changes in the remote distributed data. For this reason, recently, the RSP community investigated maintenance policies of the local replica that guarantee reactiveness while maximizing the freshness of the replica. The investigated maintenance policies focus on a class of queries that join a data stream with a distributed data source. This thesis goes beyond the state of the art, focusing on finding the most relevant answers by continuously answering query over streaming and distributed data, while considering the reactiveness constraints imposed by the users. The contributions of this study are various maintenance policies, which are tailored for two classes of queries: i) queries that have to filter data in the distributed dataset before joining it with streaming data, and ii) top-k queries where the scoring function involves data that appears both in the streaming and the distributed datasets. The contributions of this doctoral thesis are advance policies that let RSP engines continuously answer the two classes of queries described above. In particular, the proposed policies focus on refreshing only the data in the replica that contributes to the relevancy of the results. For the class of queries that have to filter the distributed data, a new maintenance policy is proposed. Intuitively, the Filter Update Policy updates data which is likely to pass the filter condition and may affect the future evaluations. While the Filter Update Policy works for queries where the filter has high selectivities, other policies work better for low selectivity. To solve this problem, as the second contribution, a rank aggregation algorithm introduced to fairly consider the opinions of multiple policies simultaneously. In the next step, focusing on the class of top-k queries, the contribution is an extended top-k query evaluation which considers the join of streaming data with the distributed dataset. Keeping a local replica of the distributed dataset, two maintenance policies are proposed to approximately answer the continuous top-k query. The experimental evaluations empirically prove the ability of the proposed policies to guarantee reactiveness, while providing more accurate and relevant results than the state of the art.

Le applicazioni che combinano (join in inglese) flussi di dati (stream in inglese) con dati distribuiti sul Web stanno riscuotendo crescente attenzione negli ultimi anni. Rispondere in modo tempestivo (cioè essere reattivi) è il più importante degli indicatori di successo per queste applicazioni. La comunità del Semantic Web ha dimostrato che l'RDF Stream Processing (RSP) è adeguato per sviluppare questo tipo di applicazioni, ma anche per un sistema RSP rimanere reattivo può essere difficile quando i dati distribuiti evolvono lentamente. Questo accade perché l'accesso ai dati distribuiti può richiedere molto tempo e la frequenza massima di accesso a tali dati può essere limitata. Lo stato dell'arte dell’RSP risolve questo problema proponendo un approccio architetturale che mantiene una replica dei dati distribuiti in locale al sistema RSP. La replica locale diventa progressivamente obsoleta se non è aggiornata per riflettere le modifiche fatte ai dati distribuiti. Per questo motivo, recentemente, la comunità degli RSP ha studiato diverse politiche di mantenimento della replica locale che garantiscono la reattività e al contempo massimizzano la freschezza della replica. Le politiche di mantenimento investigate si concentrano su una classe di query che combina dati in uno stream con dati in una sorgente distribuita. Questa tesi va oltre lo stato dell’arte focalizzandosi su query che cercano in continuo le più importanti combinazioni di dati presenti sia nello stream che nella sorgente distribuita. I contributi di questo studio sono varie politiche di mantenimento della replica locale per due classi di query: i) query che filtrano i dati nella sorgente distribuita prima di combinarli con i dati nello stream e ii) query di tipo top-k in cui la funzione di ordinamento coinvolge dati che appaiono sia nello stream che nella sorgente di dati distribuita. Il contributo di questa tesi di dottorato sono politiche di mantenimento avanzate che consentono ai sistemi RSP di rispondere in modo reattivo alle due classi di query sopra descritte. Intuitivamente, le politiche proposte riescono là, dove lo stato dell’arte falliva perché aggiornano solo dei dati della replica che contribuiscono all’identificazione dei risultati più importanti. Per la classe di query che devono filtrare i dati distribuiti, la tesi propone una nuova politica di mantenimento che si focalizza sui dati che più probabilmente supereranno le condizioni del filtro e che, quindi, potrebbero influire sulle valutazioni future. Questa politica funziona per le query in cui il filtro ha selettività elevate, ma altre politiche funzionano meglio quando la selettività è bassa. Per risolvere questo problema, un secondo contributo di questa tesi è un algoritmo che aggrega le opinioni di più politiche. Per quanto riguarda, invece, la classe delle query top-k, i contributi della tesi sono un nuovo algoritmo top-k che combina flussi di dati e sorgenti di dati distribuite e due politiche di mantenimento della replica locale ottimizzate per query top-k. Le valutazioni sperimentali dimostrano empiricamente la capacità delle politiche proposte di garantire la reattività, fornendo al contempo risultati più accurati e pertinenti rispetto allo stato dell'arte.

On relevant query answering over streaming and distributed data

ZAHMATKESH, SHIMA

Abstract

Web applications that join streaming with distributed data to provide relevant answers are getting a growing attention in recent years. Answering in a timely fashion, i.e., reactively, is one of the most important performance indicators for those applications. The Semantic Web community showed that RDF Stream Processing (RSP) is an adequate framework to develop this type of applications. However, remaining reactive can be challenging, especially when the distributed data is slowly evolving, because accessing the distributed data can be highly time consuming as well as rate-limited. State-of-the-art work addresses this problem by proposing an architectural approach that keeps a local replica of the distributed data. The local replica progressively becomes stale if not updated to reflect the changes in the remote distributed data. For this reason, recently, the RSP community investigated maintenance policies of the local replica that guarantee reactiveness while maximizing the freshness of the replica. The investigated maintenance policies focus on a class of queries that join a data stream with a distributed data source. This thesis goes beyond the state of the art, focusing on finding the most relevant answers by continuously answering query over streaming and distributed data, while considering the reactiveness constraints imposed by the users. The contributions of this study are various maintenance policies, which are tailored for two classes of queries: i) queries that have to filter data in the distributed dataset before joining it with streaming data, and ii) top-k queries where the scoring function involves data that appears both in the streaming and the distributed datasets. The contributions of this doctoral thesis are advance policies that let RSP engines continuously answer the two classes of queries described above. In particular, the proposed policies focus on refreshing only the data in the replica that contributes to the relevancy of the results. For the class of queries that have to filter the distributed data, a new maintenance policy is proposed. Intuitively, the Filter Update Policy updates data which is likely to pass the filter condition and may affect the future evaluations. While the Filter Update Policy works for queries where the filter has high selectivities, other policies work better for low selectivity. To solve this problem, as the second contribution, a rank aggregation algorithm introduced to fairly consider the opinions of multiple policies simultaneously. In the next step, focusing on the class of top-k queries, the contribution is an extended top-k query evaluation which considers the join of streaming data with the distributed dataset. Keeping a local replica of the distributed dataset, two maintenance policies are proposed to approximately answer the continuous top-k query. The experimental evaluations empirically prove the ability of the proposed policies to guarantee reactiveness, while providing more accurate and relevant results than the state of the art.

Scheda breve

Scheda completa

	Relatore
	
				DELLA VALLE, EMANUELE
			
	Coordinatore
	
				BONARINI, ANDREA
			
	Tutor
	
				CERI, STEFANO
			
	Data
	
				17-lug-2018
			
	Abstract in italiano
	
				Le applicazioni che combinano (join in inglese) flussi di dati (stream in inglese) con dati distribuiti sul Web stanno riscuotendo crescente attenzione negli ultimi anni. Rispondere in modo tempestivo (cioè essere reattivi) è il più importante degli indicatori di successo per queste applicazioni. La comunità del Semantic Web ha dimostrato che l'RDF Stream Processing (RSP) è adeguato per sviluppare questo tipo di applicazioni, ma anche per un sistema RSP rimanere reattivo può essere difficile quando i dati distribuiti evolvono lentamente. Questo accade perché l'accesso ai dati distribuiti può richiedere molto tempo e la frequenza massima di accesso a tali dati può essere limitata.
Lo stato dell'arte dell’RSP risolve questo problema proponendo un approccio architetturale che mantiene una replica dei dati distribuiti in locale al sistema RSP. La replica locale diventa progressivamente obsoleta se non è aggiornata per riflettere le modifiche fatte ai dati distribuiti. Per questo motivo, recentemente, la comunità degli RSP ha studiato diverse politiche di mantenimento della replica locale che garantiscono la reattività e al contempo massimizzano la freschezza della replica. Le politiche di mantenimento investigate si concentrano su una classe di query che combina dati in uno stream con dati in una sorgente distribuita.
Questa tesi va oltre lo stato dell’arte focalizzandosi su query che cercano in continuo le più importanti combinazioni di dati presenti sia nello stream che nella sorgente distribuita. I contributi di questo studio sono varie politiche di mantenimento della replica locale per due classi di query: i) query che filtrano i dati nella sorgente distribuita prima di combinarli con i dati nello stream e ii) query di tipo top-k in cui la funzione di ordinamento coinvolge dati che appaiono sia nello stream che nella sorgente di dati distribuita.
Il contributo di questa tesi di dottorato sono politiche di mantenimento avanzate che consentono ai sistemi RSP di rispondere in modo reattivo alle due classi di query sopra descritte. Intuitivamente, le politiche proposte riescono là, dove lo stato dell’arte falliva perché aggiornano solo dei dati della replica che contribuiscono all’identificazione dei risultati più importanti.
Per la classe di query che devono filtrare i dati distribuiti, la tesi propone una nuova politica di mantenimento che si focalizza sui dati che più probabilmente supereranno le condizioni del filtro e che, quindi, potrebbero influire sulle valutazioni future. Questa politica funziona per le query in cui il filtro ha selettività elevate, ma altre politiche funzionano meglio quando la selettività è bassa. Per risolvere questo problema, un secondo contributo di questa tesi è un algoritmo che aggrega le opinioni di più politiche. 
Per quanto riguarda, invece, la classe delle query top-k, i contributi della tesi sono un nuovo algoritmo top-k che combina flussi di dati e sorgenti di dati distribuite e due politiche di mantenimento della replica locale ottimizzate per query top-k. Le valutazioni sperimentali dimostrano empiricamente la capacità delle politiche proposte di garantire la reattività, fornendo al contempo risultati più accurati e pertinenti rispetto allo stato dell'arte.
			
	Tipo di documento
	
				Tesi di dottorato
			
	Appare nelle tipologie:
	
				Tesi di Dottorato

File allegati

File	Dimensione	Formato
2018_07_PhD_Zahmatkesh.pdf accessibile in internet per tutti Descrizione: Thesis text Dimensione 3.09 MB Formato Adobe PDF Visualizza/Apri	3.09 MB	Adobe PDF	Visualizza/Apri

I documenti in POLITesi sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10589/141258