Complex networks are a method that is becoming broadly used to represent and analyze a lot of different real-world networks, like social networks, biological processes, economical markets. Since they can be built in different ways, networks can be representations of various features. All these features, other than being analyzed, need to be integrated so that it is possible to have an overview of the process that they represent. In this work, three different networks are built and three methods are considered in order to combine their characteristics. These are Principal Component Analysis, Node2Vec (N2V) and Similarity Network Fusion (SNF). The performances of the three proposed methods are evaluated by applying them to biological data, i.e., expression data of RNA molecules. We use such methods to predict the presence and the stage of the tumor, particularly we analyze Kidney Renal Clear Cell Carcinoma (KIRC), from gene and miRNA expression data. Starting from the expression data extracted with GenoMetric Query Language (GMQL), a preprocessing of them is necessary in order to have the data cleaned, i.e., all the noisy molecules removed, and homogeneous, i.e., having the data of genes and miRNAs in the same format. The pipeline proceeds with the creation of the networks, and the first analysis is performed on the degrees, i.e., the number of the edges for each node, and on the strengths, i.e., the sum of the weights of the edges for each node. Then, the three methods for the integration are applied and are used in order to perform a feature selection of the most relevant RNA molecules. These features are passed through a Random Forest Classifier so that the samples are classified based on the condition, Normal or Cancer, or on the stage of the tumor. Respectively, the results extracted with N2V and SNF are the ones that have better performances and are therefore analyzed biologically with an enrichment analysis. It shows that the extracted molecules are involved in biological processes related to tumor development and diagnosis. Furthermore it shows that the sets are associated to Kidney tissue, as well as, other tissues. This can be the starting point of other analysis on other types of tumor.
Le reti complesse sono un metodo ampiamente utilizzato per rappresentare e analizzare sistemi del mondo reale, quali social network, processi biologici, mercati economici. Visto che possono essere costruite in vari modi, le reti complesse sono rappresentative di diverse caratteristiche dei sistemi. Oltre ad analizzare tutte le loro proprietà, è possibile anche integrarle così da avere una visione d'insieme del processo che rappresentano. In questa tesi, sono state costruite reti complesse usando tre misure diverse, e successivamente sono stati utilizzati tre diversi metodi di integrazione per unire le loro caratteristiche. Questi tre metodi sono la Principal Component Analysis (PCA), Node2Vec (N2V) e Similarity Network Fusion (SNF). L'efficacia dei tre metodi proposti è stata validata utilizzando su dati biologici di Carcinoma Renale, quali i dati di espressione associati a geni o microRNA. Partendo dai dati di espressione estratti con GMQL (GenoMetric Query Language), è stata necessaria una fase di preprocessamento: sono stati eliminati i dati rumorosi e in più sono state rese omogenee le misure associate ai geni e ai miRNA. Successivamente le reti sono state create ed è stata effettuata una prima analisi sul grado, cioè il numero di archi che ogni nodo ha, e sulle strengths, cioè la somma dei pesi degli archi per ogni nodo. Successivamente, sono stati applicati i tre metodi di integrazione per estrarre delle molecole significative alla distinzione normale/tumorale dei campioni e alla classificazione dello stadio del tumore. Difatti le molecole estratte sono state passate ad un classificatore di tipo Random Forest per poter classificare lo stato del campione e lo stadio del tumore. I risultati estratti con N2V hanno dato performance migliori per quanto riguarda la classificazione tra campioni Normali e Tumorali, mentre i risultati di SNF sono migliori per la classificazione dello stadio tumorale. Questi due insiemi di molecole di RNA sono poi analizzati usando un test statistico di arricchimento. L'analisi mostra che gli RNA estratti influenzano processi biologici associati alla diagnosi e allo sviluppo del cancro. Inoltre gli insiemi di RNA estratti sono arricchiti nel tessuto renale. I geni estratti sono anche presenti in altri tessuti, questo può essere approfondito considerando altri tipi di tumore.
Network integration algorithms for the analysis of biological data
PIDO', SARA
2018/2019
Abstract
Complex networks are a method that is becoming broadly used to represent and analyze a lot of different real-world networks, like social networks, biological processes, economical markets. Since they can be built in different ways, networks can be representations of various features. All these features, other than being analyzed, need to be integrated so that it is possible to have an overview of the process that they represent. In this work, three different networks are built and three methods are considered in order to combine their characteristics. These are Principal Component Analysis, Node2Vec (N2V) and Similarity Network Fusion (SNF). The performances of the three proposed methods are evaluated by applying them to biological data, i.e., expression data of RNA molecules. We use such methods to predict the presence and the stage of the tumor, particularly we analyze Kidney Renal Clear Cell Carcinoma (KIRC), from gene and miRNA expression data. Starting from the expression data extracted with GenoMetric Query Language (GMQL), a preprocessing of them is necessary in order to have the data cleaned, i.e., all the noisy molecules removed, and homogeneous, i.e., having the data of genes and miRNAs in the same format. The pipeline proceeds with the creation of the networks, and the first analysis is performed on the degrees, i.e., the number of the edges for each node, and on the strengths, i.e., the sum of the weights of the edges for each node. Then, the three methods for the integration are applied and are used in order to perform a feature selection of the most relevant RNA molecules. These features are passed through a Random Forest Classifier so that the samples are classified based on the condition, Normal or Cancer, or on the stage of the tumor. Respectively, the results extracted with N2V and SNF are the ones that have better performances and are therefore analyzed biologically with an enrichment analysis. It shows that the extracted molecules are involved in biological processes related to tumor development and diagnosis. Furthermore it shows that the sets are associated to Kidney tissue, as well as, other tissues. This can be the starting point of other analysis on other types of tumor.File | Dimensione | Formato | |
---|---|---|---|
2019_12_Pidò.pdf
solo utenti autorizzati dal 01/12/2022
Descrizione: Tesi
Dimensione
6.52 MB
Formato
Adobe PDF
|
6.52 MB | Adobe PDF | Visualizza/Apri |
I documenti in POLITesi sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/10589/152277