Source authority by data ownership : an approach to reliable data fusion

The era of Big Data, the World Wide Web and the new pervasive and interactive technologies have lead to an information ecosystem where a huge number of players provide their own information even about the same real world entities. When we want to integrate the data coming from such heterogeneous sources, besides the information volume another challenge consists in the impossibility, in most cases, to monitor the information procurement process of each source in order to exclude from the integration data that has been copied from another source which is already available. The Data Fusion phase of Data Integration has the main purpose of identify, within all this noisy data where sources provide conflicting or partial information, the true values that correspond to each data item. In the described context the Majority Voting approach used to select the most proposed values is ineffective, because of the frequent phenomenon of copying between sources, encouraged by the availability of the Web. Many state-of-the-art Data Fusion systems rely on the strategy of evaluating sources’ trustworthiness in terms of the veracity of the values they propose and vice-versa using an iterative approach to reach a convergence of these measures. These systems are demonstrated to be very effective but lack scalability, so that efficiency in the face of large quantities of data is low. In this thesis we propose a new approach that aims at providing a more versatile algorithm which does not rely on an iterative process and thus is more agile and scalable. This algorithm can address the truth discovery task in both single-truth and multi-truth contexts by relying on the concept of "data ownership". Assuming that the real world owner of a given entity always provides the most reliable and copy-free information about it, we decided to focus on identifying the "owner source" (if any) and rely only on the information provided by it, differentiating our solution from the iterative approaches that apply a weighted voting on values based on their sources’ authorities. The experiments performed on both single-truth and multi-truth real world datasets highlight the fact that our solution is the most efficient in returning the proposed truth, showing similar, and in some cases higher, performances.

L’era dei Big Data, del World Wide Web e delle nuove tecnologie sempre più pervasive ed interattive ha portato alla creazione di un ecosistema informativo nel quale sono presenti un numero elevatissimo di sorgenti che forniscono informazioni, spesso anche per le stesse entità. Quando vogliamo integrare informazioni provenienti da sorgenti eterogenee, oltre ai problemi legati alla grande quantità di dati disponibile, si aggiunge l’impossibilità, nella maggior parte dei casi, di controllare il processo d’acquisizione delle informazioni da parte delle singole sorgenti, in modo da evitare di considerare nell’integrazione dati già disponibili che sono stati copiati da altre sorgenti. La fase della Data Fusion all’interno della Data Integration ha come scopo principale quello di identificare, tra tutte le sorgenti che forniscono informazioni contrastanti o parziali, i veri valori da attribuire a ogni oggetto. In questo contesto, selezionare come vero il valore più frequente è spesso inefficace, a causa dei frequenti fenomeni di copia tra le sorgenti. Molti sistemi di Data Fusion utilizzano una strategia basata sulla valutazione dell’affidabilità delle sorgenti, utilizzando un approccio iterativo per raggiungere una convergenza di queste misure. Questi sistemi si sono dimostrati molto efficaci, ma spesso non presentano la scalabilità per lavorare con grandi quantità di dati. In questa tesi proponiamo un nuovo approccio volto a fornire un algoritmo più versatile, non basato su un processo iterativo e quindi più veloce e scalabile. Il metodo presentato può essere utilizzato sia in contesti single-truth che in contesti multi-truth. Assumendo che il proprietario reale di una determinata entità è anche colui il quale fornisce l’informazione più affidabile e priva di copia, abbiamo deciso di concentrarci sull’identificazione di tale "sorgente proprietaria” (se presente) e di affidarci solo all’informazione da lei presentata, differenziando la nostra soluzione da quelle che sfruttano un approccio iterativo e che infine utilizzano una votazione pesata per determinare il valore vero da assegnare ad un determinato oggetto. Gli esperimenti effettuati, sia su dataset single-truth che multi-truth, evidenziano la maggiore efficienza della nostra soluzione, mostrando allo stesso tempo una simile, se non migliore, precisione.