Schema query reverse engineering

Web-based information has exponentially grown in the last years, since our lives are becoming increasingly integrated with technology. Thus, the amount of produced data is huge so that a new term was coined: Big Data. As a consequence, the need of storing such data has grown as well. This brought the need for big, but well-organized containers, called databases. Each database is made by precise schemas. Consequently, an obvious problem may arise: how do we merge two schemas that represent the same context but are slightly different between each other? Along with the before mentioned issue, data is always evolving, therefore, also the needed schemas to store it. This has led to the development of data integration systems. The relationships between the attributes of several schemas are discovered through schema mapping. Many of the current frameworks for creating a schema mapping are designed for professional users who are familiar with how databases operate and how to establish relationships between various schemas. However, data integration systems are more and more adopted by non-technical users. The aim of our work is to provide them a more user-friendly approach to data integration. We developed an algorithm, called Schema Query Reverse Engineering, that takes as input two datasets and generates a set of queries starting from an attribute name and a value selected by the user, verifying that the results of these queries, applied to the datasets that we want to match, are the same. The final result is the mapping between the schema of the datasets, represented by an intuitive similarity matrix readable also from a non-technical user, that describes the correlation between each component of the two databases.

L’informazione basata sul web è cresciuta esponenzialmente negli ultimi anni poichè le nostre vite si stanno integrando sempre di più con la tecnologia. Per questa ragione, la quantità di dati prodotta è così grande che è stato introdotto il termine Big Data. Di consequenza, anche il bisogno di immagazzinare questi dati è cresciuto. Questo ha fatto nascere il bisogno di grandi container ben organizzati chiamati Database. Ogni database è costituito da schemi precisi (Database Schema). Ovviamente questo fa sorgere un problema: come possiamo unire due schemi che rappresentano lo stesso contesto, ma sono leggermente diversi l’uno dall’altro? Inoltre, i dati si evolvono in continuazione, e, quindi, anche gli schemi necessari a immagazzinarli. Per questo motivi si sono sviluppati i Data Integration Systems. Le relazioni tra gli attributi di diversi schemi sono individuate attraverso il processo di schema mapping. Molti framework recenti per creare uno schema mapping sono pensati per professionisti con familiarità su come funziona un database e su come sono rappresentate le relazioni tra i vari schemi. Nonostante questo, i data integration systems sono sempre più richiesti da utenti con minore competenza tecnica. L’obiettivo del nostro lavoro è di fornire loro un approccio più semplice al processo di data integration. Abbiamo sviluppato un algorimo, chiamato Schema Query Reverse Engineering che genera un set di query partendo da un attributo selezionato dall’utente e verifica che i risultati di queste query, applicate ai database che vogliamo unire, sono gli stessi. Il risultato finale è un mapping tra gli schemi dei due dataset rappresentato da una matrice di similitarità talmente intuitiva da poter essere letta anche da un utente inesperto. Questa matrice descrive la correlazione tra ogni componente dei due database.