Generation of synthetic data from digital health records

The development of technologies such as cloud computing, IoT, and social networks has caused the amount of data generated daily to grow at an incredible rate. This rapid and unstoppable growth gave birth to the term Big Data. Big data has also arrived in the healthcare field, with the introduction of new tools producing massive amounts of structured and unstructured data. For this reason, medical institutions are moving towards a data-based healthcare, with the objective of leveraging this data to support clinical decision-making through suitable information systems. This has come with the need to evaluate their performance. One of the techniques commonly used for the performance assessment of such Systems is modeling, which consists in performing the evaluation on a model of the system under analysis, without the necessity of the implementation of such system to be completed. However, in order to make an adequate performance assessment of Big Data-centered systems, we need a diversity of volumes and speeds that, due to the sensitivity of data concerning healthcare, is not available. While in other fields this problem is usually solved through the use of synthetic data generators, in the field of healthcare these are few and not specialized in performance evaluation. For this reason, in this work we focused on the creation of a synthetic data generator for the evaluation of the performance of a Big Data system model. The dataset used as a reference for the creation of the generator is MIMIC-III, which contains the digital health records of thousands of patients collected over a time span of multiple years. As a first step, an analysis of the dataset was performed, where multiple distribution fitting techniques (e.g., phase-type fitting) were adopted to model the temporal distribution of its data. After that, we developed a generator structured as a multi-module library to allow the customization of each component. Finally, we tested our generator by evaluating the performance of a simple model of a big data system in different scenarios. Through these experiments, we showed the granular control that the generator offers over the synthetic data produced, and the simplicity with which it can be adapted to different uses.

Lo sviluppo di tecnologie come il cloud computing, l'IoT e i social network ha fatto sì che la quantità di dati generati quotidianamente crescesse a un ritmo incredibile, dando vita al fenomeno Big Data. Questi sono arrivati anche nel campo sanitario con i nuovi strumenti per il monitoraggio continuo dei pazienti. Per questo, le istituzioni mediche si stanno orientando verso un'assistenza sanitaria basata sui dati per supportare, tramite sistemi informatici adatti, il processo decisionale clinico. Ciò ha comportato la necessità di valutare le prestazioni di tali sistemi. Una delle tecniche utilizzate per la valutazione delle prestazioni è la modellazione, che consiste nell'eseguire la valutazione su un modello del sistema in analisi, senza la necessità che l'implementazione di tale sistema sia completa. Tuttavia, per effettuare un'adeguata valutazione di sistemi simili, è necessaria una diversità nella quantità e nella velocità d'arrivo dei dati che, a causa della sensibilità dei dati trattati nel dominio sanitario, non è disponibile. Mentre in altri settori questo problema viene solitamente risolto attraverso l'uso di generatori di dati sintetici, nel campo della sanità questi sono pochi e non specializzati nella valutazione delle prestazioni. Per questo motivo, in questo lavoro ci siamo concentrati sulla creazione di un generatore di dati sintetici per la valutazione delle prestazioni di un modello di un sistema Big Data. Il dataset utilizzato come riferimento per la creazione del generatore è MIMIC-III, che contiene le cartelle cliniche digitali di migliaia di pazienti raccolte in un arco temporale di diversi anni. Come primo passo, è stata eseguita un'analisi del set di dati, in cui sono state utilizzate diverse tecniche di adattamento di distribuzione (ad esempio, il phase-type fitting) per modellare la distribuzione temporale dei dati. Successivamente, abbiamo sviluppato il generatore, strutturandolo come una libreria multi-modulo per consentire la personalizzazione di ogni componente. Infine, abbiamo testato il nostro generatore valutando le prestazioni di un semplice modello di sistema big data in diversi scenari. Attraverso questi esperimenti, abbiamo dimostrato il controllo granulare che il generatore offre sui dati sintetici prodotti e la semplicità con cui può essere adattato a diversi usi.