The recent advancements in the DNA sequencing technologies (next-generation sequencing) decreased the time of sequencing a human genome from weeks to hours and the cost of sequencing a human genome from million dollars to a thousand dollars. Due to this drop in costs, a large amount of genomic data are produced. This amount of available genomic data enabled the establishment of large scale sequencing data projects and the application of the big data analysis techniques in the genomics domain. In 2013, the GenoMetric Query Language (GMQL) is developed to operate on the heterogeneous genomic datasets. This thesis introduces a machine learning and data analysis module of GMQL tailored for analyzing the next-generation sequencing data. The thesis also addresses two biological problems by using the module developed. The first problem is to predict the cancer type in a multi-class cancer classification setting using the Rna-seq data acquired from the Cancer Genome Atlas (TCGA) database. The 14 different types of cancer are selected according to the leading estimated death rates by cancer type in 2017 statistic provided by the American Cancer Society. Various classification techniques are applied to the problem and the linear models such as SVM with linear kernel and logistic regression with l2 regularization term performed the best in predicting the cancer type. Logistic regression with l2 regularization, in particular, yielded a 10-fold cross validated accuracy of 93%. The second biological problem directed in this thesis is the association of mutations occurring in enhancers to specific human traits/diseases. The mutations are retrieved using a genome-wide association studies dataset and the enhancers are acquired from the ENCODE dataset. By using GMQL we identified the most frequent mutations that are associated with the diseases. Additionally, the spectral biclustering algorithm revealed a subset of mutations showing similar behavior on the subset of traits. The results are reported as an appendix for further biological interpretations.
Il recente sviluppo delle tecnologie di sequenziamento del DNA (next generation sequencing) ha ridotto il tempo necessario a sequenziare un genoma umano da diverse settimane a qualche ora, così come il costo, che è passato da milioni di dollari a circa un migliaio, consentendo così la produzione di enormi quantità di dati genomici. Ciò ha permesso la creazione di progetti di sequenziamento dati su larga scala e l’applicazione di tecniche di analisi dei big data nel campo genomico. Nel 2013, è stato sviluppato il linguaggio di interrogazione GMQL (GenoMetric Query Language) per operare su dataset genomici eterogenei. Questo lavoro di tesi introduce un modulo di GMQL per l’apprendimento automatico e l’analisi dei dati allo scopo di analizzare i dati generati dalle tecniche di sequenziamento di nuova generazione. La tesi affronta inoltre due problemi biologici utilizzando tale modulo. Il primo è quello di prevedere il tipo di cancro all’interno di una catalogazione tumorale multi-classe utilizzando i dati di Rna-seq acquisiti dal database Cancer Genome Atlas (TCGA). I 14 diversi i tipi di cancro sono stati selezionati in base ai principali tassi di mortalità stimati nel 2017, statistica fornita dalla American Cancer Society. Sono state applicate diverse tecniche di classificazione, e le migliori nel predire la tipologia di cancro sono state i modelli lineari SVM con kernel lineare e la regressione logistica con termine di regolarizzazione l2. Quest’ultimo, in particolare, ha avuto un’accuratezza di previsione di 0,9352. Il secondo problema biologico trattato in questo lavoro tesi è la correlazione tra le mutazioni negli enhancer e specifiche caratteristiche / malattie umane. Le mutazioni sono state ottenute tramite un dataset di studi associativi sull’intero genoma, mentre i dati sugli enhancers sono stati estratti dal dataset di ENCODE. Utilizzando GMQL sono state individuate le mutazioni più frequenti associate alle malattie. Inoltre, l’algoritmo spettrale di bioclustering ha rivelato un sottoinsieme di mutazioni che mostra comportamenti simili nel sottoinsieme dei caratteri. I risultati sono riportati nell’appendice per ulteriori interpretazioni biologiche.
A statistical framework for the analysis of genomic data
TUNCEL, MUSTAFA ANIL
2016/2017
Abstract
The recent advancements in the DNA sequencing technologies (next-generation sequencing) decreased the time of sequencing a human genome from weeks to hours and the cost of sequencing a human genome from million dollars to a thousand dollars. Due to this drop in costs, a large amount of genomic data are produced. This amount of available genomic data enabled the establishment of large scale sequencing data projects and the application of the big data analysis techniques in the genomics domain. In 2013, the GenoMetric Query Language (GMQL) is developed to operate on the heterogeneous genomic datasets. This thesis introduces a machine learning and data analysis module of GMQL tailored for analyzing the next-generation sequencing data. The thesis also addresses two biological problems by using the module developed. The first problem is to predict the cancer type in a multi-class cancer classification setting using the Rna-seq data acquired from the Cancer Genome Atlas (TCGA) database. The 14 different types of cancer are selected according to the leading estimated death rates by cancer type in 2017 statistic provided by the American Cancer Society. Various classification techniques are applied to the problem and the linear models such as SVM with linear kernel and logistic regression with l2 regularization term performed the best in predicting the cancer type. Logistic regression with l2 regularization, in particular, yielded a 10-fold cross validated accuracy of 93%. The second biological problem directed in this thesis is the association of mutations occurring in enhancers to specific human traits/diseases. The mutations are retrieved using a genome-wide association studies dataset and the enhancers are acquired from the ENCODE dataset. By using GMQL we identified the most frequent mutations that are associated with the diseases. Additionally, the spectral biclustering algorithm revealed a subset of mutations showing similar behavior on the subset of traits. The results are reported as an appendix for further biological interpretations.File | Dimensione | Formato | |
---|---|---|---|
2017_10_Tuncel.pdf
accessibile in internet per tutti
Descrizione: Thesis M.A. Tuncel
Dimensione
6.81 MB
Formato
Adobe PDF
|
6.81 MB | Adobe PDF | Visualizza/Apri |
I documenti in POLITesi sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/10589/136444