Recent developments in the field of biotechnologies are allowing for sequencing the genome of many organism at an unprecedented speed and with very limited costs. Genome sequencing experiments reveal many interesting properties of the cells, such as the expression of genes, the presence of mutations, the 3D conformation of DNA and its interaction with other proteins. Consequently, several international consortia have born with the mission of collecting the sequencing experiments and publishing the results for further investigation. The study of those datasets is fundamental for answering some of the most important biological questions. Tools which enable an unbiased data-driven exploration of sequencing data are particularly valuable. In my Thesis I focused on a class of those tools, namely the methods for clustering genes according to the correlation of their expression profile across many patients and biological conditions. As an outcome, I propose the Multi-X algorithm, a novel method for gene clustering, that fulfills the requirements of a good gene clustering method. This algorithm is an ensemble method of X-Means clustering that automatically detects the optimal number of clusters and avoid cluster assignment constrains by generating a noise set. The validity of the Multi-X has been proved on synthetic datasets and by means of comparison with literature results. Then, I used this method to address two real open biological questions: (a) how strong is the relationship between the 3D hierarchical organization of the genome and the expression of genes and (b) how the gene clusters in tumor cells are different from the healthy ones. The results of these studies highlighted interesting properties of the human genome, thus demonstrating the validity of the proposed clustering approach.
Recenti studi nell’ambito della biotecnologia hanno permesso di sequenziare il genoma di diversi organismi in modo rapido e ad un costo deicisamente ridotto. Tali esperimenti di sequenziamento del genoma hanno permesso la scoperta di interessanti proprietà del genoma, quali le espressioni dei geni, la presenza di mutazioni, la struttura tridimensionale del DNA e le interazioni con le proteine. Grazie a quessti miglioramenti, molti consorzi internazionali si stanno impegnando nel raccoglimento di dati di sequenziamento del DNA, con lo scopo di agevolare lo studio del genoma e rispondere ai più importanti quesiti della biologia. L’esplorazione di dati di sequenziamento tramite un approccio data-driven è di particolare interesse. Nel mio lavoro di tesi ho seguito tale approccio: ho applicato un metodo di clustering di geni basato sulla correlazione tra profili di espressioni geniche su diversi pazienti e in diverse condizioni biologiche, che fosse guidato dai dati e non da ulteriori nozioni biologiche. Propongo quindi un algoritmo, detto Multi-X, che esegue un clustering genico, soddisfando i requisiti che la letteratura suggerisce. Questo algoritmo è un ensemble di X-Means clustering che automaticamente determina il numero di classi e permette di non forzare l’assegnamento ad un cluster, ma costruisce un insieme rumore. L’efficacia dell’algoritmo è stata testata su data sintetici e tramite un confronto con risultati in letteratura. Tale metodo è stato sfuttato per rispondere a due questioni di natura biologica ancora aperte: (a) se esista una relazione tra la struttura gerarchica tridimensionale del genoma e le espressioni dei geni e (b) come si differenziano le espressioni e le correlazioni di cellule sane da quelle tumorali. I risultati di tale analisi mettono in luce interessanti caratteristiche del genoma umano, dimostrando la validità del metodo di clustering proposto.
An unbiased approach for clustering big genomic data
RACITI, ILARIA
2016/2017
Abstract
Recent developments in the field of biotechnologies are allowing for sequencing the genome of many organism at an unprecedented speed and with very limited costs. Genome sequencing experiments reveal many interesting properties of the cells, such as the expression of genes, the presence of mutations, the 3D conformation of DNA and its interaction with other proteins. Consequently, several international consortia have born with the mission of collecting the sequencing experiments and publishing the results for further investigation. The study of those datasets is fundamental for answering some of the most important biological questions. Tools which enable an unbiased data-driven exploration of sequencing data are particularly valuable. In my Thesis I focused on a class of those tools, namely the methods for clustering genes according to the correlation of their expression profile across many patients and biological conditions. As an outcome, I propose the Multi-X algorithm, a novel method for gene clustering, that fulfills the requirements of a good gene clustering method. This algorithm is an ensemble method of X-Means clustering that automatically detects the optimal number of clusters and avoid cluster assignment constrains by generating a noise set. The validity of the Multi-X has been proved on synthetic datasets and by means of comparison with literature results. Then, I used this method to address two real open biological questions: (a) how strong is the relationship between the 3D hierarchical organization of the genome and the expression of genes and (b) how the gene clusters in tumor cells are different from the healthy ones. The results of these studies highlighted interesting properties of the human genome, thus demonstrating the validity of the proposed clustering approach.File | Dimensione | Formato | |
---|---|---|---|
2017_10_Raciti.pdf
accessibile in internet per tutti
Descrizione: Thesis text
Dimensione
18.13 MB
Formato
Adobe PDF
|
18.13 MB | Adobe PDF | Visualizza/Apri |
I documenti in POLITesi sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/10589/135889