Querying the DNA : genomic computing with Apache flink

A new technology for reading the DNA, called Next Generation Sequencing (NGS), is changing biological research and medical practice, thanks to the low-cost availability of millions of whole genome sequences of a variety of species, and most important of humans. So far, the bio-informatics research community has been mostly challenged by primary and secondary analysis (data allignment and feature calling) but the emerging problem today is the so-called tertiary analyis, concerned with multi-sample processing, annotation and filtering of variants, and genome browser-driven exploratory analysis. The amount of data for tertiary analysis requires Big Data management. GenData 2020 is a project born to face this problem. The project developed two big things: the first is the data model called GDM that includes both regions data as much as meta data associated to DNA experiment; second is GMQL that is a query language defined to query data from the GDM. GDM is a general format that can merge together heterogeneous processed data so that with GMQL biologist can query multiple different sources together. The first version of GMQL was born in 2013, now the group is developing version 2. This new version aims to improve the language and move the platform to new big data frameworks that allow a better implementation and optimiziation. In this thesis I will present the design and implementation of the project with Apache Flink, an open source framework that offers API for big data management on cloud environment. In order to understand the entire project, the first chapters of the thesis will give an introduction on the current state, on the model and the language defined by the group. Then comes the core of the thesis which consists in the definition and implementation of the algorithms using Apache Flink, a platform that is totally different from Apache Pig used in version 1. In total 29 different algorithms have been developed; 5 of them related to the most important operations are explained in the thesis. Finally, the thesis presents the testing phase of the project and comparison of the Flink version with the Spark version, which will be presented at the IEEE conference.

Una nuova tecnologia per leggere il DNA, chiamata Next Generation Sequencing (NGS), sta cambiando la ricerca biologica e le pratiche mediche , grazie alla disponibilità a basso costo di milioni di sequenze di DNA di una vasta varietà di specie, tra cui l'uomo. Finora la comunità di ricerca bio-informatica si è concentrata perlopiù sull'analisi primaria e secondaria (allineamento e correlazione), ma il problema recente è quello dell'analisi terziaria, che riguarda il processamento di molti sample sperimentali e l'esplorazione attraverso browser visivi. La quantità di dati processata per l'analisi terziaria richiede un sistema Big Data. GenData 2020 è un progetto di ricerca nato per affrontare questo problema. Due sono i principali sviluppi di questo progetto finora: la definizione di un modello di dati chiamato GDM che include sia i dati riguardanti le regioni del DNA che i meta dati associati, è un modello generale, che permette di unire nella stessa sorgente dati eterogenei, in maniera che sia possibile interrogare contemporaneamente diverse sorgenti di dati; e un linguaggio di interrogazione del GDM, chiamato GMQL. La prima versione di GMQL è nata nel 2013, ora il gruppo sta sviluppando la versione 2. Lo scopo è di migliorare il linguaggio, e di ottimizzare l'esecuzione. In questa tesi si presenteranno degli algoritmi sviluppati per il progetto usando il Apache Flink, un framework open source per il processamento di big data in un ambiente cloud. Per comprendere lo sviluppo e il progetto di questa tesi, è necessario fare un'introduzione allo stato corrente della ricerca del gruppo, al modello dati e al linguaggio definiti in passato. Dopodiché si passa al cuore della tesi, che consiste nella progettazione e implementazione degli algoritmi necessari ad eseguire il linguaggio GMQL. In totale gli algoritmi scritti sono 29: 5 di questi, le operazioni più importanti del linguaggio, verrano presentati in questa tesi. Infine verrà presentata la fase di test e un confronto tra la versione Flink e quella Spark oggetto del paper che il gruppo presenterà a una conferenza IEEE.