Solving tuple level conflicts in data fusion exploiting data dependencies

Over the last few decades, the invention of Internet and the emergence of World Wide Web has made it possible and useful to access many information systems to obtain information. The role of the databases and the technologies associated to it have changed dramatically in the last 20 years. The world has moved from having close and central databases for record-keeping to the open and interconnected sources of structured information systems. With the increase of different sources that provide the same type of information arises also the need to give the user a fully integrated view of the world, requiring the integration of data from multiple sources. Each of the sources provides a set of values and different sources can often provide conflicting values. To present quality data to users, it is critical that data integration systems can resolve conflicts and discover true values. This thesis aims to show how the data dependencies, which have always been considered as a hurdle to the integration process, can be a very useful tool when it comes to cleaning data and resolving the tuple level conflicts during the integration process. Data dependencies are constraints that define the relationship between attributes. They occur when information stored in a database uniquely determines other information stored in the same database. A medical database from two different sources having the same schema and the same data representation has been taken as the running example. The data provided by the two sources contains various data dependencies, which are highlighted, explained and classified. Possible solutions are provided on how to solve the conflicts and how to handle the situation when encountering different type of dependencies during the data cleaning and integration process. A conflict resolution architecture is presented that takes raw data from the sources and outputs a single clean consistent integrated database. The entire architecture is based on data dependencies and is divided in three important steps. The first step is that of cleaning the data provided by the sources exploiting the data constraints. This step takes care of the noise and unwanted outlier values and makes sure that a wrong value is excluded from the data before the integration step. Second and the most important step is that of integration which consists of resolving the tuple level conflicts and contradictions using the functional dependencies present in the data. The third and last step is that of cleaning the integrated database and finding the correct values for the null data values left after the integration step. This last step takes care of the remaining uncertainties using the data already present in the database and exploiting the patterns of semantically related constants through the conditional and relaxed functional dependencies. Various solutions are presented for the different conflicts, which can be applied according to the specific priorities, and taking into consideration the authority and the reliability of the sources. A prototype of a web application “ConflictSolver” is created where the user can first mine different types of data dependencies and then use them to execute the three phases of the architecture. The application provides users the possibility to choose out of the most cited and important algorithms currently available to automatically discover the functional, conditional, and relaxed dependencies.

Negli ultimi decenni, l'invenzione di Internet e l'emergere del World Wide Web hanno reso possibile e utile l'accesso a molti sistemi informativi per ottenere informazioni. Il ruolo dei database e delle tecnologie ad essi associate è cambiato radicalmente negli ultimi 20 anni. Il mondo è passato dall'avere database chiusi e centrali per la conservazione dei record alle sorgenti aperte e interconnesse tra loro. Con l'aumento delle diverse sorgenti che forniscono lo stesso tipo di informazioni nasce anche l'esigenza di dare all'utente una visione del mondo completamente integrata, richiedendo l'integrazione di dati provenienti da più sorgenti. Ciascuna delle sorgenti fornisce un insieme di valori e sorgenti diverse possono spesso fornire valori contrastanti. Per presentare dati di qualità agli utenti, è fondamentale che i sistemi di integrazione dei dati possano risolvere i conflitti e scoprire i veri valori. Questa tesi si propone di mostrare come le dipendenze dei dati, da sempre considerate un ostacolo al processo di integrazione, possano essere uno strumento molto utile quando si tratta di pulire i dati e risolvere i conflitti a livello di tuple durante il processo di integrazione. Le dipendenze dei dati sono vincoli che definiscono la relazione tra gli attributi. Si verificano quando le informazioni contenute in un database determinano in modo univoco altre informazioni contenute nello stesso database. A tal fine è stato preso per la valutazione un database medico da due diverse sorgenti aventi lo stesso schema e la stessa rappresentazione dei dati. I dati forniti dalle due sorgenti contengono varie dipendenze di dati, che vengono evidenziate, spiegate e classificate. Vengono fornite possibili soluzioni su come risolvere i conflitti e come gestire la situazione quando si incontrano diversi tipi di dipendenze durante il processo di pulizia e integrazione dei dati. Viene presentata un'architettura di risoluzione dei conflitti che prende i dati dalle sorgenti e genera un singolo database integrato pulito e consistente. L'intera architettura si basa sulle dipendenze dei dati ed è suddivisa in tre importanti fasi. La prima fase è quella di pulire i dati forniti dalle sorgenti sfruttando i vincoli sui dati. Questa fase si occupa dei valori anomali indesiderati e assicura che un valore errato venga escluso dai dati prima della fase di integrazione. La seconda e la più importante fase è quella dell'integrazione che consiste nel risolvere i conflitti a livello di tuple utilizzando le dipendenze funzionali presenti nei dati. L'ultimo passaggio è quello di pulire il database integrato e trovare i valori corretti per i valori rimasti vuoti dopo il processo di integrazione. Quest'ultimo passaggio si occupa delle rimanenti incertezze utilizzando i dati già presenti nel database e sfruttando i pattern di costanti semanticamente correlate attraverso le dipendenze funzionali condizionate e rilassate. Vengono presentate diverse soluzioni per i diversi conflitti, che possono essere applicate secondo le priorità specifiche e tenendo in considerazione l'autorità e l'affidabilità delle sorgenti. Viene creato un prototipo di un'applicazione web “ConflictSolver” dove l'utente può prima estrarre diversi tipi di dipendenze di dati e poi usarle per eseguire le tre fasi dell'architettura. L'applicazione offre agli utenti la possibilità di scegliere tra gli algoritmi attualmente disponibili più citati e importanti per scoprire automaticamente le dipendenze funzionali, condizionali e rilassate.