The design of a data lake architecture for the healthcare use case : problems and solutions

With the growth of data volume and data heterogeneity in the healthcare domain, the need for new types of storage systems has emerged. In particular, there is the need to store different types of data, e.g. lab reports, medical images, genomics data, in a central repository in order to create an organized and rich dataset. Since traditional approaches cannot satisfy these requirements, we opted for the adoption of a Data Lake. A Data Lake is a centralized repository that allows to store structured, semi-structured, and unstructured data at any scale. However, the current Data Lake architectures do not support the specific needs of healthcare data, therefore we proposed an architecture that can fit this use case. The objective was to design a system that can efficiently ingest, store and process medical images, such as x-rays, CTs and PETs, as well as the metadata related to these images. This last aspect plays a very important role in the Data Lake architcture, as it allows full exploitation of the data value. In order to better understand the requirements that the architecture should satisfy, we developed a simple implementation of some steps of the data flow across the Data Lake. The prototype takes as input a series of x-rays and their annotations, performs image preparation and image analysis, and finally stores the results in a relational database. This has allowed us to quantify both the number of features that are generally extracted from a single image and the resource usage during these processes. Moreover, we proposed two cloud solutions that can realize the proposed architecture: the first one takes advantage of services offered by Amazon, while the second one uses services provided by Microsoft. Overall, our work is a step in the right direction to implement a Data Lake that can be used in the healthcare environment and, we believe that, by adding some extensions, a complete solution can be implemented.

La costante crescita del volume e dell'eterogeneità dei dati in ambito sanitario, ha fatto emergere la necessità di nuovi tipi di storage. In particolare in questo settore vi è la necessità di raccogliere diversi tipi di dati, come referti di analisi, dati genomici e immagini mediche, in un sistema centralizzato che permetta di creare un dataset organizzato e completo. Considerato che gli approcci tradizionali non riescono a gestire in maniera appropriata questo tipo di problema, abbiamo optato per l'utilizzo di un Data Lake. Il Data Lake è una repository centralizzata che permette di salvare su qualsiasi scala dati strutturati, semi-strutturati e non strutturati. La mancata creazione di un'architettura generale per questo tipo di sistema, ci ha indotto a proporre questo lavoro che cerca di dare una soluzione al problema. L'obiettivo è quello di progettare un sistema che efficacemente ingerisca, salvi e processi immagini mediche, come radiografie, PET e TAC, e che allo stesso tempo sia in grado di gestire i metadati associati a tali immagini. Questo ultimo aspetto va analizzato con attenzione quando si progetta il Data Lake perchè permette di sfuttare appieno il valore dei dati. Per comprendere meglio i requisiti che l'architettura deve soddisfare, abbiamo sviluppato una implementazione di alcune parti del Data Lake. Più nel dettaglio, l'implementazione riceve in input una serie di radiografie con le relative annotazioni, esegue gli step di preparazione, le analizza e salva i risultati di quest'ultimo passaggio su un database relazionale. Ciò ha permesso di quantificare sia il numero di features che vengono estratte da una singola immagine, sia l'utilizzo delle risorse durante questi processi. Inoltre, nella tesi vengono proposte due soluzioni cloud che sviluppano l'architettura presentata, la prima soluzione sfrutta i servizi offerti da Amazon, mentre la seconda utilizza servizi forniti da Microsoft. In sintesi, questo lavoro rappresenta il primo passo dell'implementazione di un Data Lake in ambito sanitario e con le dovute estensioni potrà fornire una soluzione adeguata.