Address matching toolkit : a solution to uniquely identify street addresses

This thesis work is based on the results of a 6-months internship carried out in PriceHub- ble, an international company that provides AI-driven real estate valuations and insights. This work explores different types of address matching solutions, namely the usage of the street address as a proxy to uniquely identify a building or a property. Thus, it provides a portable toolkit that standardises and identifies a raw input address. One of the main challenges with addresses is that the same address can be written in many different ways. This leads to a lot of complications for computer scientists. Out there many companies, including giants like Google, are trying to build tools capable of correctly handling an address, from its standardization to validation. The Address Toolkit presented in this thesis proposes a solution to this problem, willing to solve also spelling errors. It simplifies the work of the company’s engineers when it comes to working on address manipulation. A task that tend to be time-consuming and extremely case specific. The standardisation part of the toolkit implemented in this work uses Libpostal, a library for parsing international street addresses using statistical NLP and open data. The iden- tification part exploits internal DBs of PriceHubble to match the queried input address and returns the unique id of the property it refers to. Thanks to its practical business na- ture, this work exploits the experience made through day-to-day tasks, from data pipeline implementation to DB augmentation. The results obtained show a good precision both in the standardisation and identification step. On a sample of around 30M raw addresses, 90% of them are correctly standardised and, from these, 94% are correctly identified. The toolkit services have finally been ex- posed through an API for external user usage.

Questo lavoro di tesi è il frutto di uno tirocinio di 6 mesi svolto in PriceHubble, azienda internazionale che fornisce valutazioni e informazioni immobiliari sfruttando l’AI. Questo lavoro esplora diversi tipi di soluzioni per il match di indirizzi, vale a dire l’uso di un indirizzo stradale come proxy per identificare in modo univoco un edificio o una proprietà. Fornisce quindi un toolkit portatile che standardizza e identifica un indirizzo stradale. Una delle principali sfide quando si ha a che fare con un indirizzo è che esso può essere scritto in molti modi diversi. Questo porta a molte complicazioni per un ingegnere infor- matico. Là fuori molte aziende, tra cui colossi come Google, stanno cercando di costruire strumenti in grado di manipolare correttamente indirizzi stradali, dalla loro standardiz- zazione alla loro validazione. L’Address Toolkit presentato in questa tesi propone una soluzione a questo problema, disposto a risolvere anche errori di ortografia. Esso semplifica il lavoro degli ingegneri in azienda quando si tratta di lavorare sugli indirizzi. Un compito che tende a richiedere tempo ed essere estremamente specifico. La parte di standardizzazione del toolkit utilizza Libpostal, una libreria per il parsing di indirizzi internazionali utilizzando NLP e open data. La parte di identificazione sfrutta DB interni di PriceHubble per matchare un indirizzo in input e restituisce l’id univoco della proprietà a cui si riferisce. Grazie alla sua natura pratica aziendale, questo lavoro sfrutta l’esperienza maturata nelle attività quotidiane, dall’implementazione di pipeline di dati al DB augmentation. I risultati ottenuti mostrano una buona precisione sia nella fase di standardizzazione che di identificazione. Su un campione di circa 30 milioni di indirizzi grezzi, il 90% di essi vengono correttamente standardizzati e, da questi, il 94% vengono correttamente identi- ficati. I servizi di questo toolkit sono stati infine esposti attraverso un’API.