A flexible approach to data quality assessment for big data sources

In the current technological era, collecting data has become easier due to the availability of innovative and cheap machines and sensors connected to computers and to each other thanks to fast and advanced communication infrastructures. This quantity of available and analysable data is making the world smarter. However, such data can create a real value only if combined with data quality: data could be noisy, wrong, or incomplete whenever an error, of any type, occurs, and good decisions and actions are always the results of correct, reliable and complete data. A proper use of suitably high-quality data can yield quantitative measurements that allow the improvement of operational efficiencies of business and industrial processes. New algorithms have to be designed in order to deal with novel requirements related to variety, volume and velocity issues of Big Data. In particular, dealing with heterogeneous sources requires an adaptive approach able to trigger the suitable quality assessment methods on the basis of the data types and context in which data have to be used. In this thesis a data quality service has been designed and developed to be able to make the users (humans or applications) aware of the quality of Big Data sources on the basis of their goals. Considering data taken from a smart city as a case study, this thesis proposes a new general module that analyses Big Data sources to derive multiple quality indicators based on a set of dimensions. Such module has been implemented by using Apache Spark. Moreover, the module is flexible since the portion of data to analyse is automatically selected on the basis of the users non-functional requirements.

Nell'era tecnologica attuale, collezionare i dati è diventato più semplice per la disponibilità di macchine e sensori economici ed innovativi, che sono connessi tra loro, e con potenti computer, grazie ad infrastrutture di comunicazione molto veloci ed avanzate. Questa grande quantità di dati disponibile ed analizzabile sta rendendo intelligente tutto il mondo. Tuttavia, questi dati possono creare un valore solo se combinati con la propria qualità: i dati possono essere rumorosi, errati o incompleti qualora un errore, di qualsiasi tipo, avviene, e le buone decisioni ed azioni sono sempre il risultato di dati corretti, affidabili e completi. Un uso corretto di dati di alta qualità può produrre misure quantitative che consentono di migliorare l'efficienza operativa di processi economici ed industriali. Nuovi algoritmi devono essere progettati al fine di affrontare i nuovi requisiti legati ai problemi di Volume, Variabilità e Velocità dei Big Data. In particolare, per analizzare correttamente sorgenti eterogenee sono richiesti approcci adattivi in grado di scegliere i metodi di valutazione di qualità più opportuni in base al tipo ed al contesto in cui i dati devono essere utilizzati. In questa tesi è stato progettato e sviluppato un servizio di qualità dei dati in grado di rendere gli utenti (esseri umani o applicazioni) coscienti della qualità delle sorgenti di Big Data in base ai propri obiettivi. Considerando i dati presi da una "smart city" come caso di studio, questa tesi propone un nuovo e generale modulo in grado di analizzare le sorgenti di Big Data per derivare molteplici indicatori di qualità basati su un set di dimensioni. Questo modulo è stato implementato utilizzando Apache Spark. Inoltre, il modulo è flessibile poiché la porzione di dati da analizzare è selezionata automaticamente in base ai requisiti non-funzionali espressi dagli utenti.