Computational inference of DNA folding principles: from data management to machine learning

DNA is the molecular basis of life and consists of approximately three billion base pairs, which would total about three meters if linearly untangled. To fit in the cell nucleus at the micrometer scale, DNA has, therefore, to fold itself several times. Only recently it was possible to reveal that genomes fold into several layers of hierarchical structures. Of major biological importance are Topologically Associating Domains (TADs), which are genomic regions of high self-interaction and low interactions across them, which are thought to be associated with functional compartmentalization of genomic features like genes and their regulatory elements. For this reason, understanding the mechanisms of TADs formation and genome folding is a major biological research problem. Studying chromatin conformation requires high computational resources and complex data analyses pipelines. For this reason, it is necessary to develop scalable and interactive software for the exploration of such big genomic datasets. In this thesis, I first present my efforts towards the design and implementation of the PyGMQL software for interactive and scalable data exploration for genomic data. PyGMQL, which extends the previously developed GMQL system, allows the user to programmatically inspect arbitrarily big genomic datasets and design complex analysis pipelines. It also allows the integration of heterogeneous data, coming from different experimental procedures, through the adoption of the Genomic Data Model. PyGMQL presents itself as a easy-to-use Python library and interacts seamlessly with other data analysis packages. I also present my contributions to the extension of the GMQL system itself: in the first, we created a federated database implementation of the GMQL engine; in the second, we designed the ScQL language and data model, generalizing the concepts of the GMQL language to arbitrary scientific data. In the second part of the thesis, I apply my software to the study of chromatin conformation data. I study the epigenetic determinants of TADs, with a focus on the positions and motif orientations of binding sites of the CTCF insulator protein. I discover a set of spatial rules of CTCF orientation which correlate with TADs and their topological characteristics. The results of this study highlight the existence of a "grammar of genome folding" which dictates the formation of TADs and boundaries. I also present an on-going extension of this work, where I study how histone modifications and transcription factors affect chromatin conformation. I finally focus on the relationship between chromatin conformation and gene expression. I model this question as a machine learning problem, designing a graph representation learning model for the encoding of chromatin topological features of genes. The learnt gene embeddings are then used as inputs to a Random Forest classifier which is trained to predict if two genes are co-expressed or not, given independent gene expression data. The results indicate a correlation between chromatin topology and co-expression, shedding a new light on this debated topic and providing a novel computational framework for the study of co-expression networks. In this thesis, I therefore provide both a novel software stack for the analysis and exploration of genomic data as well as a set of novel biological results, contributing to shedding light on the mechanisms of TADs' formation and the relationship between chromatin conformation, epigenetics, and gene expression.

Il DNA è alla base della vita ed è composto da circa tre miliardi di coppie di basi azotate, le quali, se srotolate linearmente, raggiungerebbero la lunghezza di circa tre metri. Di conseguenza, per risiedere nel nucleo della cellula ad una scala micrometrica, il DNA deve avvilupparsi su se stesso numerose volte. Solo recentemente è stato possibile rivelare che il genoma si avviluppa in diversi livelli gerarchici. Di grande importanza in questo contesto sono i Domini Topologici (TADs), che rappresentano regioni del genoma fortemente interagenti al loro interno e con scarse interazioni con il resto del genoma. Viene ipotizzato che i Domini Topologici siano un meccanismo di compartimentazione funzionale di meccanismi biologici come l'espressione genica e i suoi elementi regolatori. Per queste ragioni, è di fondamentale importanza comprendere i meccanismi alla base della formazione dei TADs e, in generale, dell'avvolgimento del genoma. Lo studio della conformazione della cromatina richiede elevate risorse computazionali e analisi dei dati complesse. Per questa ragione, è anche necessario sviluppare software per l'esplorazione interattiva e scalabile di queste grandi basi di dati biologiche. In questa tesi, in primo luogo, presento il mio lavoro sul design e l'implementazione di PyGMQL, un software per l'esplorazione scalabile e interattiva di dati genomici. PyGMQL, il quale estende il sistema GMQL, permette all'utente di ispezionare programmaticamente grandi moli di dati genomici e di progettare complesse pipeline di analisi dati. Il software permette anche l'integrazione di dati eterogenei provenienti da differenti procedure sperimentali grazie all'adozione del Genomic Data Model. PyGMQL si presenta come una libreria Python di semplice utilizzo e interagisce facilmente con altri pacchetti software di analisi dati. Presento anche il mio contributo all'estensione del sistema GMQL stesso: in primo luogo, ho contribuito all'implementazione di un sistema federato basato sull'engine di GMQL; in secondo luogo, ho contribuito al design del linguaggio ScQL e il relativo data model, generalizzando i concetti del linguaggio GMQL verso dati scientifici arbitrari. Nella seconda parte della tesi, applico il mio software allo studio di dati di conformazione di cromatina. Studio gli elementi epigenetici determinanti dei TADs, con attenzione alla posizione e orientazione dei motivi nei siti di binding della proteina isolante CTCF. Questo ci ha portato a scoprire un insieme di regole basate sull'orientazione di CTCF che correlano con i TADs e le loro caratteristiche topologiche. I risultati di questo studio sottolineano l'esistenza di una "grammatica di avvolgimento del DNA" che determina la formazione dei TADs e dei loro confini. Presento anche un'estensione di questo lavoro attualmente in corso, dove studio come le modificazioni istoniche e altri fattori di trascrizione correlano e determinano la conformazione della cromatina. In fine, mi focalizzo sulla relazione fra conformazione della cromatina ed espressione genica. Modello questa domanda di ricerca come un classico problema di machine learning, definendo un modello di representation learning per l'encoding delle caratteristiche topologiche dei geni. Gli embedding appresi sono poi utilizzati come input di un modello di classificazione basato su Random Forest, allenato per predire se due geni sono co-espressi o no, sulla base di dati di espressione genica indipendenti. I risultati indicano una correlazione fra topologia della cromatina e co-espressione genica, portando una luce nuova su questo dibattuto problema e fornendo un nuovo modello computazionale per studiare reti di co-espressione. In questa tesi, quindi, presento un nuovo stack software per l'analisi e l'esplorazione di dati genomici e anche un insieme di nuovi risultati biologici, i quali contribuiscono a chiarire alcuni meccanismi di formazione dei TADs e la relazione fra conformazione della cromatina, l'epigenetica e l'espressione genica.