Deep-learning-based solutions for the blocking phase in entity linkage

The integration of data from different sources is a relevant task for companies, government agencies, banks and many other actors to carry out some important real world activities: merging customers databases, analyzing population statistics and managing clients profiles are just few examples. Integrating data in these scenarios is relatively simple, especially when the data sources have clean and standard attributes. However, with the increasing role of internet based services like e-commerce, products comparison web sites or online libraries, data integration is becoming more and more challenging. These services deal with data that is typically noisy and that contains attribute values written in natural language such as product descriptions or book reviews. Integrating such data is harder because of the difficulties in managing dirty values and in extracting semantics out of long textual attributes written in natural language. In particular, the challenging part in such setting is in identifying which records from the several source data sets represent the same concept or the same entity. The task of finding matching records across data sets in a data integration activity is called entity linkage. This thesis explores novel techniques designed to support this linking phase when the data sets to be merged contain attributes written in natural language or with noisy values. In particular this work focuses on the blocking phase which is a key part of entity linkage and it will be thoroughly described. The models that we propose are based on neural networks and on word and sentence embeddings, paradigms that are showing interesting and promising results in Natural Language Processing related tasks. Not only will be the models presented in their design and implementation but they will also be tested on some real data sets to show the improvements that they can guarantee over traditional methods.

L’integrazione di dati provenienti da più sorgenti è un processo cruciale per aziende, agenzie governative, banche e molti altri enti per poter realizzare attività reali di notevole importanza: unire i database di clienti, analizzare statistiche di una popolazione e gestire i profili dei propri clienti sono solamente alcuni esempi. Integrare i dati in questi scenari è relativamente semplice, specialmente quando le sorgenti hanno valori degli attributi standard e puliti. Tuttavia, con la crescente importanza di servizi internet quali l’e-commerce, siti di confronto di prodotti o librerie online, integrare dati sta diventando sempre più impegnativo. Questi servizi gestiscono dati che sono tipicamente sporchi e che contengono valori di attributi espressi in linguaggio naturale: si pensi ad esempio a descrizioni di prodotti o a recensioni di libri. Integrare tali tipologie di dati è più complesso a causa delle difficoltà nel gestire dati sporchi e nella incapacità di estrarre contenuto semantico da testo scritto in linguaggio naturale. In particolare, in tale contesto risulta impegnativo riuscire a identificare quali record provenienti dai diversi data set rappresentino lo stesso concetto o la stessa entità. Questo processo di identificazione di record corrispondenti fra più data set fa parte dell’attività d’integrazione di dati ed è chiamato entity linkage. Questa tesi esplora tecniche innovative pensate per supportare questa specifica fase quando i data set da integrare contengono attributi scritti in linguaggio naturale o impuri. In particolare questo lavoro è focalizzato sulla fase di blocking che rappresenta una componente chiave dell’entity linkage e in seguito sarà analizzata a fondo. I modelli che proponiamo si basano su reti neurali e su rappresentazioni numeriche di parole e frasi del linguaggio naturale, soluzioni che stanno mostrando risultati interessanti e promettenti in campi associati del Natural Language Processing. Non solo i nostri modelli saranno presentati nella loro architettura e implementazione ma saranno anche testati su data set reali per mostrare i miglioramenti che possono garantire rispetto a metodi più tradizionali.