Forecast of renal failure for patients affected by chronic kidney disease with machine learning techniques on hospital data

Recently, machine learning (ML) techniques obtained considerable success in the field of medical diagnosis, with a strong growth of diagnostic questions for which algorithms are designed. This was certainly assisted by the digital revolution of the last few years, which made possible to record and store various types of clinical data made easily available and usable in digital format. One of the areas that is gaining significant interest in the clinical field is concerning chronic diseases. Some examples can be found in the prediction and early detection of diabetes and its complications, chronic obstructive pulmonary disease, heart disease or cancer. It has been proved that ML techniques allow a better management of chronic and fragile patients. In addition to being more appropriate in view of a preventive, rather than curative, model of care, analysis of this type allow for more accurate classifications of the patient’s clinical status and an optimization of the resources employed by the hospital. To be mentioned in this context is the innovative organizational method, deliberated by the Lombardy Region, for taking charge of chronic and fragile patients. Specifically, the project called Individualized Care Plan (ICP) aims at maximizing the personalization of the patient care and at improving the quality of diagnosis, treatment, assistance and rehabilitation of the patients. This thesis, in collaboration with the Vimercate Hospital, one of the most developed hospitals in Lombardy and Italy in terms of Information technology (IT) and digital infrastructure, focuses on the evolution of the Chronic Kidney Disease (CKD). CKD is a condition for which there is no cure, which affects around 30 million Americans (approximately 1 in 7 adults) and 2.2 million Italians (7% of the population) and consist in annual care costs over 32 billion in USA and 2-2.5 billion euros in Italy, close to 2% of total health expenditure. The goal of this thesis is to define and implement a computational model based on machine learning algorithms, trained on the data available in the information assets of the Vimercate Hospital, able to estimate the time frame within which a dialysis treatment may be required for patients suffering from severe Chronic Kidney Disease. The conducted study was supervised by the Vimercate Hospital’s nephrologists who contributed in all its phases, from the initial formulation of the diagnostic problem to address, to the discussion of the results to verify their consistency. The work implies the construction of a dataset containing, in a structured and usable way, all possible useful informations on which the ML models are trained. This information is extracted from the data collected during the last 13 years by the Vimercate Hospital thanks to the adoption of electronic medical records in early 2000, including structured and unstructured data coming from different databases. In extracting the possibly useful patient’s characteristics used as input features for the trained models, we considered informations such as age, sex, associated pathology or comorbidities, family history and blood test results, initially accounting for more than 60 input features. The target variable to be predicted is defined as the number of elapsed months from a clinical evaluation of the patient’s kidney function to the beginning of a dialysis treatment. We considered more than 900 distinct patients, each of them with several clinical evaluations over the period of observation. The final dataset is composed of more than 4,000 rows, or instances, representing a patient specific clinical situation (sex, current age and renal function, clinical history, recent and past blood tests, ...) and referring to 906 different patients; each instance includes the target variable and the presumably useful input features for describing the clinical situation and predicting when the patient will have to be necessarily dialyzed. In predicting the target variable, we considered different approaches and different algorithms. We compared the performance when addressing the problem as a classification problem where we define relevant month intervals (for example 0-6 months, 7-18 months, more than 18 months) and we let the classifier determine in which class the patient with the current clinical situation belongs to, or when addressing the problem as a regression problem, where we let the model predict the number of months until dialysis treatment and then we discretize the output in intervals as for the classification problem. For both approaches, we tried different algorithms such as Neural Networks, Decision Trees, Support Vectors Machines and others. We noticed that the classification approach works better than the regression one, independently of the used algorithm. As far as the algorithms, those that obtain the best performance, both for classification and regression, are the Decision Trees. We then further explore the classification approach with Decision Trees and ensembles methods. Based on the results of the best performing models and the previous correlation analysis, performing a posteriori feature elimination we managed to halve the number of relevant features used in the training phase without losing in performance. Our best model, an ensemble of Decision Trees, reaches an overall test accuracy up to 94% considering only 27 features. The results coming from the feature importance analysis are in line with the literature and with the doctors’ knowledge. Therefore, we propose a set of Extremely Randomized Trees (shortly ExtraTrees) classifiers able to predict the occurrence of renal failure within default month intervals and with an overall accuracy that depends on the required granularity of those intervals. The first ExtraTrees model that we propose consists in a binary classifier able to predict if the patient will have to be dialyzed or not within 1 year, obtaining an accuracy of 94% both on the validation set and the test set (and Area Under the Curve of 0.98 computed over the test set). Alternatively, we propose a ExtraTrees Classifier able to predict the occurrence of the patient renal failure within the first 6 months, between 6 and 18 months, or after 18 months (3 classes in total), with a validation accuracy of 90% and a test accuracy of 91%. Finally, we propose a 4-class ExtraTrees model, predicting the occurrence of renal failure within the first 6 months, between 6 and 14 months, between 14 and 24 months, or more than 24 months, with a validation accuracy of 88% and a test accuracy of 87%. Given the unpredictable nature of CKD’s evolution, the obtained results are promising and the doctors show great interest in the approach. It is our intention to further test the developed computational models in order to integrate them in the decision support system of the Hospital. The implemented computational approach, coupled with doctors’ knowledge and experience, will allow greater accuracy in predicting the patient’s clinical pathway, leading to strategic planning and providing better personalized care for the patient’s needs. Doctors will therefore have the opportunity to plan more effectively the next clinical visits, scheduling them within a short period of time if there is evidence of a rapid deterioration in renal function and therefore paying more attention to patients at risk; vice versa, if the patient is not at risk, it will be possible to postpone the next clinical check to the following months and thus optimizing both the resources used by the hospital (in terms of staff, department crowding, exam prescription, ...), and the time and energy of the patient undergoing the clinical visits. The beginning of dialysis treatment itself can be planned in advance with precision, allowing both doctors and patients to organize themselves in the most appropriate manner. Furthermore, it will be possible to have more details on the progression of the disease for the individual patient, being able to analyze how the predictions of the computational models change from a clinical observation to the subsequent ones with respect to the administered treatment, the lifestyle or the diet of the patient. Moreover, the study allows to better understand what are the clinical and physiological characteristics of the patient that most determine the speed with which the CKD advances. In fact, the study shows that there are several factors that influence the rate at which the disease progresses and that, the estimated glomerural filtration rate, the main parameter used in the clinical context for the stratification of patients based on the risk of incurring into complete renal failure in the short term, is not sufficient to estimate in advance by when the patient will have to be dialyzed. This is probably due to the fact that the same glomerural filtration rate is estimated by a mathematical formula present in the clinical literature which, in addition to the levels of creatinine in the blood, tries to take into account also the patient’s gender and ethnicity through multiplying coefficients, empirically tested. On the other hand, when computational models of this type receive as input these and other patient-related informations, it is possible to more accurately estimate the relationships between the several variables and the rate at which the disease progresses. Some examples of informations that have resulted important for the purpose of the prediction are, in addition to patient’s age and comorbidities, recent observations and analyzes on the respective trends of creatinine, urea and the amount of red cells in the blood, specific gravity and estimated glomerural filtration values. Surprisingly, the latter does not appear to be the only one, nor the most important, among the factors to accurately predict when the patient will have to be dialyzed.

Una delle aree che sta acquisendo un interesse significativo nel campo clinico riguarda le malattie croniche. Alcuni esempi possono essere trovati nella previsione e nella diagnosi precoce di patologie come il diabete e le sue complicanze, la broncopneumopatia cronica ostruttiva, le malattie cardiache o il cancro. È stato dimostrato che le tecniche di machine learning consentono una migliore gestione dei pazienti cronici e fragili. Oltre ad essere più appropriate in vista di un modello di cura preventivo, piuttosto che curativo, analisi di questo tipo consentono classificazioni più accurate dello stato clinico del paziente e un’ottimizzazione delle risorse impiegate dall’ospedale. Da segnalare in questo contesto l’innovativo metodo organizzativo, deliberato dalla Regione Lombardia, per la presa in carico di pazienti cronici e fragili. In particolare, il progetto denominato Individualized Care Plan (ICP) mira a massimizzare la personalizzazione delle cure e a migliorare la qualità della diagnosi, del trattamento, dell’assistenza e della riabilitazione dei pazienti. Questa tesi, in collaborazione con l’Ospedale di Vimercate, uno degli ospedali più sviluppati in Lombardia e in Italia in termini di tecnologia dell’informazione (IT) e infrastruttura digitale, si concentra sull’evoluzione della malattia renale cronica (CKD). La malattia renale cronica è una condizione per la quale non esiste una cura, che colpisce circa 30 milioni di americani (circa 1 su 7 adulti) e 2,2 milioni di italiani (7% della popolazione) e consiste in spese di assistenza annuali di oltre 32 miliardi di dollari negli Stati Uniti e 2-2,5 miliardi di euro in Italia, vicino al 2% della spesa sanitaria totale. L’obiettivo di questa tesi è definire e implementare un modello computazionale basato su algoritmi di machine learning, addestrato sui dati disponibili nelle risorse informative dell’Ospedale di Vimercate, in grado di stimare il lasso di tempo entro il quale un trattamento di dialisi può essere richiesto per i pazienti che soffrono di grave malattia renale cronica. Lo studio condotto è stato supervisionato dai nefrologi dell’Ospedale di Vimercate che hanno contribuito in tutte le sue fasi, dalla formulazione iniziale del problema diagnostico da affrontare, alla discussione dei risultati per verificarne la coerenza. Il lavoro implica la costruzione di un dataset contenente, in modo strutturato e utilizzabile, tutte le possibili informazioni utili su cui vengono allenati i modelli di machine learning. Queste informazioni sono estratte dai dati raccolti negli ultimi 13 anni dall’ospedale Vimercate grazie all’adozione di cartelle cliniche elettroniche all’inizio del 2000, comprendenti dati strutturati e non strutturati provenienti da diversi database. Nell’estrarre le caratteristiche del paziente possibilmente utili e utilizzate come variabili di ingresso per i modelli addestrati, abbiamo preso in considerazione informazioni quali età, sesso, patologie associate o comorbidità, anamnesi familiare e risultati degli esami del sangue, constituendo inizialmente più di 60 input features. La variabile target da prevedere è definita come il numero di mesi trascorsi da una valutazione clinica della funzionalità renale del paziente all’inizio di un trattamento di dialisi. Abbiamo preso in considerazione più di 900 pazienti distinti, ciascuno con diverse valutazioni cliniche durante il periodo di osservazione. Il dataset finale è composto da oltre 4.000 righe o istanze, che rappresentano una situazione clinica specifica del paziente (sesso, età attuale e funzione renale, anamnesi, esami del sangue recenti e passati, ...) e riferiti a 906 pazienti diversi; ogni istanza include la variabile target e le caratteristiche di input presumibilmente utili per descrivere la situazione clinica e prevedere quando il paziente dovrà essere necessariamente dializzato. Nel predire la variabile target, abbiamo preso in considerazione approcci e algoritmi diversi. Abbiamo confrontato le prestazioni quando affrontiamo il problema come un problema di classificazione in cui definiamo intervalli di mesi rilevanti (ad esempio 0-6 mesi, 7-18 mesi, più di 18 mesi) e lasciamo che il classificatore determini in quale classe il paziente con la corrente situazione clinica appartiene, o quando si affronta il problema come un problema di regressione, in cui lasciamo che il modello preveda il numero di mesi fino al trattamento di dialisi e quindi discretizziamo l’output in intervalli come per il problema di classificazione. Per entrambi gli approcci, abbiamo provato diversi algoritmi come reti neurali, alberi decisionali, Support Vector Machines e altri. Abbiamo notato che l’approccio di classificazione funziona meglio di quello di regressione, indipendentemente dall’algoritmo utilizzato. Per quanto riguarda gli algoritmi, quelli che ottengono le migliori prestazioni, sia per la classificazione che per la regressione, sono gli alberi decisionali. Esploriamo quindi ulteriormente l’approccio di classificazione con gli alberi decisionali e i metodi di ensemble. Sulla base dei risultati dei modelli con le migliori prestazioni e della precedente analisi di correlazione, eseguendo a posteriori una selezione sulle features siamo riusciti a dimezzare il numero di variabili utilizzate nella fase di addestramento senza perdere in prestazioni. Il nostro miglior modello, un ensemble di alberi decisionali, raggiunge un’accuratezza complessiva sul test set del 94 % considerando solo 27 variabili. I risultati derivanti dall’analisi dell’importanza delle variabili sono in linea con la letteratura e con le conoscenze dei medici. Pertanto, proponiamo una serie di Extremely Randomized Trees classifiers (in breve ExtraTrees) in grado di prevedere l’insorgenza di insufficienza renale entro intervalli di mese predefiniti e con un’accuratezza complessiva che dipende dalla granularità richiesta di tali intervalli. Il primo modello ExtraTrees che proponiamo consiste in un classificatore binario in grado di prevedere se il paziente dovrà essere dializzato o meno entro 1 anno, ottenendo una precisione del 94% sia sul validation set che sul test set (e area sotto la curva di 0,98 sul test). In alternativa, proponiamo un classificatore ExtraTrees in grado di prevedere l’insorgenza dell’insufficienza renale del paziente entro i primi 6 mesi, tra 6 e 18 mesi o dopo 18 mesi (3 classi in totale), con un’accuratezza del 90% sul validation set e del 91% sul test set. Infine, proponiamo un modello ExtraTrees a 4 classi, che prevede l’insorgenza di insufficienza renale entro i primi 6 mesi, tra 6 e 14 mesi, tra 14 e 24 mesi o più di 24 mesi, con un’accuratezza complessiva dell’88% sul validation set e dell’87% sul test set. Data la natura imprevedibile dell’evoluzione della malattia renale cronica, i risultati ottenuti sono promettenti e i medici mostrano grande interesse per l’approccio. È nostra intenzione testare ulteriormente i modelli computazionali sviluppati al fine di integrarli nel sistema di supporto decisionale dell’ospedale. L’approccio computazionale implementato, unito alle conoscenze e all’esperienza dei medici, consentirà una maggiore precisione nella previsione del percorso clinico del paziente, portando a una pianificazione strategica e fornendo una migliore assistenza personalizzata per le esigenze del paziente. I medici avranno quindi l’opportunità di pianificare in modo più efficace le prossime visite cliniche, fissandole in un breve periodo di tempo se vi sono prove di un rapido deterioramento della funzionalità renale e quindi prestando maggiore attenzione ai pazienti a rischio; viceversa, se il paziente non è a rischio, sarà possibile posticipare il prossimo controllo clinico ai mesi successivi e quindi ottimizzare sia le risorse utilizzate dall’ospedale (in termini di personale, affollamento di reparto, prescrizione di esami, ... ), sia il tempo e l’energia del paziente nel sottoporsi alle visite cliniche. L’inizio del trattamento dialitico stesso può essere pianificato in anticipo con precisione, consentendo a medici e pazienti di organizzarsi nel modo più appropriato. Inoltre, sarà possibile avere maggiori dettagli sulla progressione della malattia per il singolo paziente, potendo analizzare come le previsioni del modello computazionale cambiano da un’osservazione clinica a quelle successive rispetto al trattamento somministrato, lo stile di vita o la dieta del paziente. Infine, lo studio consente di comprendere meglio quali sono le caratteristiche cliniche e fisiologiche del paziente che determinano maggiormente la velocità con cui la malattia renale cronica avanza. Infatti, lo studio mostra che ci sono diversi fattori che influenzano la velocità con cui la malattia progredisce e che, il tasso di filtrazione glomerurale stimato, il principale parametro utilizzato nel contesto clinico per la stratificazione dei pazienti in base al rischio di insorgenza di insufficienza renale completa nel breve termine, non è sufficiente per stimare in anticipo entro quando il paziente dovrà essere dializzato. Ciò è probabilmente dovuto al fatto che lo stesso tasso di filtrazione glomerurale è stimato da una formula matematica presente nella letteratura clinica che, oltre ai livelli di creatinina nel sangue, cerca di prendere in considerazione anche il sesso e l’etnia del paziente attraverso coefficienti moltiplicativi, empiricamente testati. D’altra parte, quando modelli computazionali di questo tipo ricevono come input queste e altre informazioni relative al paziente, è possibile stimare più accuratamente le relazioni tra le molteplici variabili e la velocità con cui la malattia progredisce. Alcuni esempi di informazioni che sono risultate importanti ai fini della previsione sono, oltre all’eta’ e le comorbidita’ del paziente, le recenti osservazioni e analisi sui rispettivi trend della creatinina, dell’urea e della quantita’ di globuli rossi nel sangue, del peso specifico e dei valori di filtrazione glomerurale stessa. Sorprendentemente, quest’ultima non sembra essere l’unico, né il più importante, tra i fattori per prevedere con precisione quando il paziente dovrà essere dializzato.