Entity resolution with query reverse engineering

With the advent of the Big Data era, also due to the spread of internet and the intense use of smart devices, nowadays we have a large amount of information available. Given their different origins and the possibility that they may also reside in different sources (cloud, hybrid or on-premise), being able to integrate and combine these data is an important and valuable task. Data integration is the set of those processes aimed at combining data from multiple heterogeneous sources and unifying them in a single view. The three main steps of Data Integration are: Schema Alignment, Entity Resolution and Data Fusion. Schema Alignment aims to recognize which attributes have the same meaning in the different schemas of the sources. The intent of Entity Resolution, instead, is to identify which data refer to the same entity. Finally, Data Fusion aims to establish the exact value of each entity. In this thesis work we focused on the second phase, and in particular we have used Query Reverse Engineering techniques to automatically resolve the Entity Resolution phase. With Query Reverse Engineering we refer to a set of techniques that, provided the result of a query and a database, are capable of obtaining a query, or a set of instance equivalent queries that can produce that same result. In this thesis we first analyzed the strengths and weaknesses of Query Reverse Engineering by evaluating different architectures; then we used these techniques to create an algorithm able to resolve the Entity Resolution problem. Through extensive experimental tests we have shown that our procedure can be effectively used in Data Integration problems.

Con l’avvento dei Big Data, grazie anche alla diffusione di internet e l’intenso utilizzo di dispositivi smart, oggigiorno abbiamo a disposizione una grande quantità di informazioni. Data la loro diversa provenienza e la possibilità che possano risiedere anche in sorgenti diverse (cloud, ibride o on-premise), poter integrare e unire questi dati è un task importante e di grande valore. Data integration è l’insieme di quei processi mirati a combinare dati provenienti da multiple sorgenti eterogenee e unificarli in un’unica vista. I tre step principali della Data Integration sono: Schema Alignment, Entity Resolution e Data Fusion. Schema Alignment si occupa di riconoscere quali attributi hanno lo stesso significato nei diversi schemi delle sorgenti. Entity Resolution, si propone invece di individuare quali dati si riferiscono alla stessa entità. Infine, Data Fusion si pone l’obiettivo di stabilire il valore esatto di ogni entità. In questo lavoro di tesi ci siamo focalizzati sulla seconda fase, e in particolare abbiamo utilizzato tecniche di Query Reverse Engineering per risolvere in modo automatico lo step di Entity Resolution. Con Query Reverse Engineering ci riferiamo ad un insieme di tecniche che, dato il risultato di una query e un database, sono capaci di ottenere una query, o un set di instance equivalent queries che producono lo stesso risultato. In questa tesi abbiamo prima analizzato i pregi e i difetti di Query Reverse Engineering valutando differenti architetture; abbiamo poi utilizzato queste tecniche per la creazione di un algoritmo in grado di risolvere il problema della Entity Resolution. Attraverso estensivi test sperimentali abbiamo dimostrato che la nostra procedura può essere efficacemente utilizzata nei problemi di Data Integration.