A preliminary study on using query reverse engineering to perform schema alignment

With the explosion of Web-based data sources, users are directly involved in the Data Integration process because they need a way of linking all the available information. More specifically, our work is focused on the Schema Alignment phase of Data Integration, whose goal is to find semantic correlations between attributes of different schemas. Since in the last years the role of data has become central, one challenge that has to be addressed is the need of a lightweight approach to Schema Alignment. Already existing solutions for modern schema mapping require user interaction to provide exemplar tuples, but this can lead to ambiguities and inconsistencies in the produced mappings. Our goal is to present a methodology that performs schema mapping in an automatic way, without user influence. We present an instance-based approach, that leverages a Query Reverse Engineering (QRE) algorithm in order to find the mappings between attributes of different schemas. First, in order to work with both numerical and categorical values, in the QRE algorithm a One-Hot Encoder is applied to the categorical attributes on both the input database and the example table. Since we focus on singlerelation databases, the generation of Instance-Equivalent Queries (IEQs) is implemented using Decision Trees with different goodness criteria for the splitting phase. Finally, the Instance-Equivalent Queries are written by translating eventual encoded attributes back to their original form. Given an example table Q(D), the main idea of our solution is to apply the QRE algorithm to the source database and to every possible match for Q(D) in the target. Then, the mapping can be generated by comparing the set of IEQs obtained from the source with the different sets of IEQs obtained from the target, chosing those with best match.

Con l’aumento delle sorgenti di dati sul Web, gli utenti sono direttamente coinvolti nel processo di Data Integration perché hanno bisogno di mettere in relazione tra di loro tutte le informazioni disponibili. Ci siamo concentrati sulla fase di Schema Alignment, in cui si cercano le corrispondenze semantiche tra gli attributi di schemi diversi. Negli ultimi anni, il ruolo dei dati è diventato centrale ed è nato il bisogno di avere un approccio più leggero per lo Schema Alignment. Le soluzioni esistenti più moderne richiedono un’interazione con l’utente per sfruttare degli esempi di tuple, ma questo può portare ad avere ambiguità e inconsistenze nei mapping prodotti. Il nostro obiettivo è di presentare una metodologia che permetta di eseguire questa fase in maniera automatica, senza l’influenza degli utenti. L’approccio che presentiamo è instance-based e sfrutta un algoritmo di Query Reverse Engineering (QRE) per trovare i possibili mapping tra attributi di schemi diversi. Innanzitutto, per gestire sia valori numerici che categorici, l’algoritmo applica un One-Hot Encoder agli attributi che assumono valori categorici sia sul database in input che nella tabella esempio. Siccome ci siamo concentrati su database composti da una singola relazione, la generazione delle Instance-Equivalent Queries (IEQs) viene implementata con degli Alberi Decisionali che utilizzano diversi criteri per stabilire la qualità degli split. Alla fine, le Instance-Equivalent Queries sono scritte traducendo eventuali attributi codificati nella loro forma originale. Data una tabella esempio Q(D), l’idea principale della nostra soluzione è di applicare l’algoritmo di QRE al database sorgente e ad ogni possibile corrispondenza di Q(D) trovata nel target. In seguito, i mapping vengono generati comparando l’insieme di IEQs ottenuto dalla sorgente con i diversi insiemi di IEQs ottenuti dal target, scegliendo quelli con il miglior match.