A Data Lake Prototype for Healthcare

With the growth of Internet of Things and the rapid progress of the various social networks, everything appears to generate data. The ever-increasing number of connected devices is accompanied by an ever-increasing volume of data produced at an ever-increasing rate. This massive flow of data generated includes data types that are difficult to process using standard databases, or standard techniques. One of the domains that produces a large amount of hard-to-manage data is healthcare. In particular, in this context there is the need to store and manage different types of data, e.g., reports written in natural language, medical images, genomics data, and waveforms of vital signs which do not have a well-defined structure. In order to exploit this complex data for benefits, Data Lakes have recently emerged as a concept to ensure central storage and flexible analysis for all types of data. However, currently, there is no single Data Lake architecture that fits all the scenarios. In fact, it depends heavily on the domain in which it works and, so far, there are no Data Lake architectures that support the specific needs of the healthcare domain. This thesis proposes a Data Lake architecture that effectively performs the functions of data ingestion, data storage, and data access with the aim of providing a single central repository and efficient analysis to the different types of data in the healthcare domain. The architecture also enables the analysis and querying of the data, which can be loaded into the Data Lake regardless of their format and type. In addition, in order to verify the effectiveness of the architecture, a prototype of the designed architecture is developed. This prototype allows ingestion of various data, performs waveforms processing with the aim of making them more interpretable to researchers and analysts, grants access to any saved data and provides analysis of reports written in natural language by exploiting machine learning techniques for keyword extraction. In addition, the prototype was subjected to a performance evaluation study in each of the main phases of its work: ingestion, processing, data access and data analysis. From the results, some considerations to be taken into account when using and configuring the system components have emerged.

Con la crescita dell'Internet of Things e il rapido progresso dei vari social network oggi esistenti, qualsiasi apparecchio è ormai in grado di generare dati. Il numero sempre crescente di dispositivi connessi è accompagnato da un volume sempre maggiore di dati generati a una velocità molto elevata. Questo enorme flusso di dati comprende tipi di dati complessi da elaborare con database o tecniche standard. Uno dei settori che genera una grande quantità di questi dati è quello sanitario. In particolare, in questo contesto vi è la necessità di memorizzare e gestire diversi tipi di dati, come rapporti scritti in linguaggio naturale, immagini mediche, dati genomici e forme d'onda di segni vitali che non hanno una struttura ben definita. Al fine di sfruttare le informazioni derivanti da essi per trarne benefici, i Data Lake sono emersi di recente come concetto per garantire l'archiviazione centrale e l'analisi flessibile di ogni tipo di dato. Attualmente non esiste un'unica architettura di Data Lake che si adatti a qualsiasi scenario. Infatti, il suo design dipende fortemente dal dominio in cui si opera e, di fatto, non esistono architetture Data Lake che supportino le esigenze specifiche del settore sanitario. Questa tesi propone un'architettura Data Lake che svolge efficacemente le funzioni di caricamento, archiviazione e accesso ai dati con l'obiettivo di fornire un unico repository centrale per dati sanitari. L'architettura consente anche l'analisi e l'interrogazione dei dati, che possono essere caricati nel Data Lake indipendentemente dal loro formato e tipo. Per verificare l'efficacia dell'architettura, è stato sviluppato un prototipo del sistema progettato. Il prototipo può consentire il caricamento e l'archiviazione di diversi tipi di dato, può eseguire l'elaborazione di forme d'onda per renderle più interpretabili da ricercatori e analisti, può garantire l'accesso a qualsiasi dato salvato e può fornire l'analisi di report scritti in linguaggio naturale sfruttando tecniche di machine learning per l'estrazione di parole chiave. Inoltre, il prototipo è stato sottoposto a uno studio di valutazione delle prestazioni in ciascuna delle sue principali fasi di lavoro: caricamento dati, elaborazione dati, accesso e analisi dati. Dallo studio sono emersi risultati da considerare durante la configurazione e l'utilizzo dei componenti del sistema.