Due to NGS techniques, whole genome sequences are produced much cheaper and faster every year, thus genomic data is being gathered at a pace never seen before. By processing NGS data new sense making relationships between genomic regions are being found and fundamental biological questions are answered; therefore managing NGS data now seems to be the most important big data problem of humankind. As the new NGS data generated are mostly heterogeneous, they are not easily interopera-ble. The Genomic Data Model (GDM) allows describing NGS data in a homogeneous way for their interoperation. GMQL is a next-generation query language that by means of using GDM data, gives genomics specific domain operations to biologists to process large volumes of data for discovering biological knowledge. This thesis studies the improvement of NGS data analysis by automating and standardizing the genomic data and their experimental metadata integration into a GDM repository. The software developed is GMQLImporter; it extracts NGS data from multiple data providers, transforms the data according to GDM specifications and loads standardized GDM datasets into GMQL for further querying. GMQLImporter was tested to download, transform and load into a GDM repository 133,648 samples gathered from 2 different data providers and organized into 16 datasets. This work provides the capabilities to be easily extended to include samples from new data sources and in this way provide more NGS data to be queried and making new discoveries in bioin-formatics.
A causa delle tecniche di NGS, le sequenze intere del genoma vengono prodotte ogni anno molto più economiche e veloci, quindi i dati genomici vengono raccolti ad un ritmo mai visto. Processando i dati NGS, si trovano nuovi relazioni tra diverse regioni genomiche e per tanto, domande biologiche fondamentali sono risolte; quindi la gestione dei dati NGS sembra ora essere il problema di big data più importante dell’umanità. Dato che i nuovi dati NGS generati sono eterogenei, non sono facilmente interoperabili. Il Genomic Data Model (GDM) consente di descrivere i dati NGS in modo omogeneo per la loro interoperabilità. GMQL è un next-generation query language che, usando i dati GDM, fornisce ai biologi di operazioni a dominio specifico per la genomica per processare grandi volumi di dati per scoprire nuove conoscenze biologiche. Questo lavoro qui presentato studia il miglioramento dell’analisi NGS automatizzando e standardizzando l’integrazione dei dati con i loro metadati sperimentali in un repository GDM. Il software sviluppato è GMQLImporter; estrae i dati NGS da diversi fornitori di dati, trasforma i dati in base alle specifiche del GDM e carica i set di dati GDM standarizzati in GMQL per ulteriori interrogazioni. GMQLImporter è stato testato per scaricare, trasformare e caricare in un repository GDM 133.648 campioni raccolti da 2 fornitori di dati diversi e organizzati in 16 insiemi di dati. GMQLImporter offre le funzionalità per essere estesso facilmente in modo da includere campioni provenienti da nuove fonti di dati e in questo modo fornire ulteriori dati NGS da interrogare e fare nuove scoperte in bioinformatica.
Automation of retrieval, transformation and uploading of genomic data and their metadata for their integration into a GDM repository
VERA PENA, JORGE IGNACIO
2016/2017
Abstract
Due to NGS techniques, whole genome sequences are produced much cheaper and faster every year, thus genomic data is being gathered at a pace never seen before. By processing NGS data new sense making relationships between genomic regions are being found and fundamental biological questions are answered; therefore managing NGS data now seems to be the most important big data problem of humankind. As the new NGS data generated are mostly heterogeneous, they are not easily interopera-ble. The Genomic Data Model (GDM) allows describing NGS data in a homogeneous way for their interoperation. GMQL is a next-generation query language that by means of using GDM data, gives genomics specific domain operations to biologists to process large volumes of data for discovering biological knowledge. This thesis studies the improvement of NGS data analysis by automating and standardizing the genomic data and their experimental metadata integration into a GDM repository. The software developed is GMQLImporter; it extracts NGS data from multiple data providers, transforms the data according to GDM specifications and loads standardized GDM datasets into GMQL for further querying. GMQLImporter was tested to download, transform and load into a GDM repository 133,648 samples gathered from 2 different data providers and organized into 16 datasets. This work provides the capabilities to be easily extended to include samples from new data sources and in this way provide more NGS data to be queried and making new discoveries in bioin-formatics.File | Dimensione | Formato | |
---|---|---|---|
Jorge_Ignacio_Vera_Pena_Thesis_withcorrections.pdf
accessibile in internet per tutti
Descrizione: Complete thesis with appendix and corrections
Dimensione
2.47 MB
Formato
Adobe PDF
|
2.47 MB | Adobe PDF | Visualizza/Apri |
I documenti in POLITesi sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/10589/135029