Automated Deep Learning segmentation of bone and lytic spinal lesions in CT scans

Osteolytic lesions of the spine are a critical focus in clinical research due to their association with severe pathological conditions, such as metastatic cancers. These lesions present many diagnostic challenges due to the fact that their appearance closely resembles other visually similar diagnoses (disc spaces or vertebral hemangiomas) and due to the difference in shape, size and intensity among different patients. This thesis explores the implementation of Deep Learning (DL) architectures for the segmentation of lytic lesions in CT scans, provided by the IRCCS Galeazzi Hospital in Milan. In this study, four architectures were implemented, differing from those used in the current state of the art. They all follow a U-Net shape but with various modifications. The development pipeline was implemented in two main macro-blocks. The first phase involved pre-training these models on the public Spine-Mets-CT-SEG dataset, where they were trained to segment the entire vertebral bone. In the second phase, the models were fine-tuned using the previously obtained weights for the task of segmenting osteolytic lesions. Finally, a weighted ensemble of the four models was implemented, evaluating it with Dice and IoU scores. The best of the four models evaluated on the public dataset test set, achieved a Dice of 0.933±0.098 and an IoU of 0.886±0.118. The ensemble model, only tested on the dataset concerning osteolytic lesions, obtained a Dice of 0.621±0.258 and an IoU of 0.501±0.246. Subsequently, the ensemble was also evaluated on another control set, achieving a Dice of 0.693 ±0.174 and an IoU of 0.556 ±0.192. The study also focused on validating the segmentation model and inferring how it could detect TP, TN, FP, and FN on the complete test set (including healthy and diseased slices): with multiple thresholds, the algorithm labels slices as positive or negative by comparing their average Dice score, with four neighboring slices, against the threshold. If the score is below it, the slice is negative; otherwise, it is positive. By choosing the threshold on the test set, that choice is definitely not usable in real-world scenarios, but this is just a post-hoc and exploratory step. To facilitate clinical use, a prototype tool was developed that provides an intuitive graphical user interface. This study shows how DL models can automate clinical processes, making them applicable to other medical scenarios.

Le lesioni osteolitiche della colonna vertebrale rappresentano un punto critico nella ricerca clinica a causa della loro associazione con condizioni patologiche gravi, come metastasi. Questo tipo di lesione presenta molte sfide diagnostiche poiché il suo aspetto è molto simile ad altre diagnosi visivamente affini (spazi discali o emangiomi vertebrali) e per la variazione di caratteristiche tra diversi pazienti. Questa tesi esplora l’implementazione di architetture di Deep Learning (DL) per la segmentazione di lesioni litiche su scansioni TAC, fornite dall’Ospedale IRCCS Galeazzi di Milano. In questo studio, sono state implementate quattro architetture differenti rispetto a quelli utilizzati nello stato dell’arte attuale. Tutti seguono una struttura U-Net, ma con diverse modifiche. Il processo di sviluppo è stato implementato in due principali macro-blocchi. Prima, il pre-training sul dataset pubblico Spine-Mets-CT-SEG, utile per segmentare le intere ossa vertebrali. Poi, i modelli sono stati perfezionati utilizzando i pesi precedentemente ottenuti per il compito di segmentare esclusivamente le lesioni osteolitiche nei soli slices lesionati. Infine è stato implementato un ensemble pesato dei quattro modelli, valutato su Dice e IoU scores. Il migliore dei quattro modelli valutato sul test del dataset pubblico ha raggiunto un Dice di 0.933 ±0.098 e un IoU di 0.886 ±0.118. Il modello ensemble, testato solo sul dataset relativo alle lesioni osteolitiche, ha ottenuto un Dice di 0.621 ±0.258 e un IoU di 0.501 ±0.246. Successivamente, l’ensemble è stato anche valutato su un altro set di controllo, raggiungendo un Dice di 0.693 ±0.174 e un IoU di 0.556 ±0.192. Lo studio si è anche concentrato sulla validazione del modello di segmentazione su come possa rilevare TP, TN, FP e FN sull’intero set di test (includendo sia slices sani che malati): con soglie multiple, l’algoritmo etichetta gli slices come positivi o negativi confrontando il loro punteggio medio di Dice, calcolato sui quattro vicini, rispetto alla soglia. Se il punteggio è inferiore, lo slice è negativo, altrimenti positivo. Scegliendo la soglia sul test set, questa scelta non è sicuramente utilizzabile in scenari reali, ma si tratta solo di uno step post-hoc ed esplorativo. Per facilitare l’uso clinico, è stato sviluppato un tool prototipo con una interfaccia grafica intuitiva. Questo studio dimostra come i modelli di DL possano automatizzare i processi clinici, rendendoli applicabili ad altri scenari medici.