Graph neural network pipeline for unsupervised clinical document analysis

Most clinical documents are present in the form of unstructured data i.e. textual data. However, in order to apply automatic learning methods of the content, it is necessary to provide a mathematical representation of the text; this is one of the tasks of Natural Language Processing. In this thesis work, it will be analyzed the E3C dataset, which contains 10473 clinical cases, with the goal of producing clusters of documents, useful to the medical staff. The dataset does not provide labels for the documents, therefore, an unsupervised approach was required. Inspired by the recent success of the Graph Neural Networks (GNNs) in the similar field of supervised text classification, we have produced a fully unsupervised modification of the InfoGraph model to be able to create vector representations of E3C documents and through classical clustering methods group them according to their similarities. Before being subjected to the model, the documents have been represented as a homogeneous graph, in which each node represents a word in the text and the links between nodes represent the connection between words that appear adjacent in the text. To evaluate the results, in the absence of labels, we relied on the UMLS dictionary, on the preexisting method Doc2Vec, and on labels created automatically by ChatGPT. This is one of the first totally unsupervised graph neural network work, and to our knowledge, the first to deal with textual documents. The results show how the vectorial representations of the texts can derive information regarding the origin of the documents but only minimally about the medical content. Several modifications can be implemented to improve the performance of the model, confirming that the field of research under analysis may offer many opportunities for future development.

La maggior parte dei documenti clinici sono presenti in forma di dati non strutturati ovvero dati testuali. Tuttavia, per poter applicare dei metodi automatici di apprendimento del contenuto, è necessario fornire una rappresentazione matematica del testo; questa è una delle mansioni dell’Elaborazione del Linguaggio Naturale. In questo lavoro di tesi sarà trattato il dataset E3C, che contiene 10473 casi clinici, con l’obiettivo di produrre dei clusters di documenti, utili al personale medico. Il dataset non fornisce etichette per i documenti, pertanto è stato necessario un approccio non supervisionato. Ispirati dal recente successo delle Reti Neurali a Grafo (GNNs) nel simile ambito di classificazione supervisionata di testi, abbiamo prodotto una modifica totalmente non supervisionata del modello InfoGraph per poter creare delle rappresentazioni vettoriali dei documenti di E3C e tramite dei classici metodi di clustering raggrupparli in base alle loro similarità. I documenti prima di essere stati sottoposti al modello, sono stati rappresentati sotto forma di grafo omogeneo, nel quale ogni nodo rappresenta una parola del testo e i collegamenti fra i nodi rappresentano la connessione fra le parole che appaiono adiacenti nel testo. Per valutare i risultati, in assenza di etichette, abbiamo fatto affidamento sul dizionario UMLS, sul metodo preesistente Doc2Vec e su alcune etichette create in maniera automatica da ChatGPT. Questo è uno dei primi lavori di reti neurali a grafo totalmente non supervisionato, e a nostra conoscenza, il primo che si occupasse di documenti testuali. I risultati mostrano come le rappresentazioni vettoriali dei testi siano in grado di trarre delle informazioni riguardo l’origine dei documenti ma solo in misura minima riguardo il contenuto medico degli stessi. Diverse modifiche possono essere attuate per migliorare le prestazioni del modello, a conferma del fatto che il campo di ricerca in analisi possa offrire molte opportunità di sviluppi futuri.