Topologically Associating Domains (TADs) are genomic regions having the peculiarity that enhancers and gene promoters mainly interact only within the same TAD, making TADs fundamental for the process of gene expression. TAD boundaries are enriched with sites binding for a protein called CTCF and these sites are asymmetric: they will be called forward sites if their motif appears from left to right (>), or reverse sites in case they appear from right to left (<). This work is built upon a prior research by Nanni et al. and it aims at disclosing whether or not the phenomenon observed in the human can be observed in the mouse. The cited phenomenon consists in CTCF-binding sites having a peculiar distribution around TAD boundaries: divergent (<>) sites are enriched and convergent (><) sites are depleted at TAD boundaries, while forward and reverse sites peak respectively on the right and on the left side of boundaries. This work identified the positions of CTCF-binding sites in the human and mouse genomes and studied their characteristics, revealing that the orientations of these sites follow a specific grammar, where forward and reverse sites must appear roughly the same number of times, and the same goes for convergent (><) and divergent (<>) sites. For the mouse a set of consensus boundaries was obtained and the orientations of the CTCF-binding sites around these boundaries was studied, which confirmed the observations made on the human: divergent sites are enriched at TAD boundaries, while convergent sites are depleted, and forward and reverse sites tend to locate respectively to the right and to the left of boundaries. Afterwards two Machine Learning models, the Random Forest and the Convolutional Neural Network, were used to try and predict the positions of boundaries given the positions and orientations of CTCF-binding sites. These models were not able to accurately predict the boundaries' positions. Another approach providing satisfactory results is thus proposed in this work: TAD boundaries' positions are predicted using a sequence of a Simulated Annealing algorithm and a greedy algorithm, starting from randomly generated positions and randomly substituting them until the distributions around these new predicted boundaries resemble the ones observed around known boundaries. This leads to a new perspective: the phenomenon observed in both the human and the mouse is not strongly connected to single TAD boundaries, but it involves the organization of the entire genome. Thus the positions of TADs cannot be predicted by considering the local characteristics of portions of the genome, but the entire genome must be considered as a whole.
I "Topologically Associating Domains" (TADs) sono regioni del genoma caratterizzate dal fatto che gli enhancer e i promotori dei geni interagiscono tra loro solo se all'interno dello stesso TAD, e ciò rende i TAD un elemento fondamentale per l'espressione genetica. Le estremità dei TAD contengono molti siti di legame per una particolare proteina chiamata CTCF. Tali siti sono asimmetrici, perciò viene loro associata una direzione: vengono chiamati in avanti se appaiono da sinistra verso destra (>) e all'indietro viceversa (<). Questa tesi è costruita su una ricerca precedente di Nanni et al. e ha l'obiettivo di verificare se il fenomeno osservato in tale studio per l'uomo si verifichi anche nel topo. Tale fenomeno consiste in una particolare distribuzione che i siti di legame per la proteina CTCF assumono intorno alle estremità dei TAD: vi sono molti siti divergenti (<>) e pochi siti convergenti (><) attorno alle estremità dei TAD e siti in avanti e all'indietro si concentrano rispettivamente a destra e a sinistra di tali estremità. In questa ricerca le posizioni dei siti di legame nei genomi dell'uomo e del topo sono state trovate e le loro caratteristiche sono stata studiate, rivelando che tali siti seguono una grammatica: i siti in avanti e quelli all'indietro devono comparire circa nello stesso numero e coppie di siti convergenti (><) e divergenti (<>) devono alternarsi lungo il genoma. Nel topo le direzioni dei siti nell'intorno delle estremità dei TAD sono state studiate, rivelando che ciò che è stato osservato per l'uomo vale anche per il topo: vicino alle estremità ci sono molti siti divergenti e pochi siti convergenti; inoltre i siti in avanti tendono ad essere posizionati a destra delle estremità, mentre quelli all'indietro si trovano maggiormente a sinistra. Successivamente due modelli di Machine Learning, il Random Forest e una Rete Neurale Convoluzionale, sono stati utilizzati per cercare di predire le posizioni delle estremità dei TAD a partire dalle posizioni e dalle direzioni dei siti di legame per la CTCF. Tali modelli non hanno prodotto risultati accettabili. Viene quindi proposto in questo lavoro un altro algoritmo che dà risultati promettenti e che è costituito da una sequenza di due sotto-algoritmi: un algoritmo di Simulated Annealing seguito da un algoritmo greedy, che partono da una serie di posizioni casuali in un cromosoma e le cambiano tentando di minimizzare la distanza tra le distribuzioni delle direzioni dei siti osservate intorno alle posizioni random da quelle calcolate intorno alle estremità dei TAD note. Questo porta a una nuova prospettiva: il fenomeno osservato sia nell'umano, sia nel topo non riguarda le singole estremità dei TAD, ma l'organizzazione dell'intero genoma. Perciò le posizioni dei TAD non possono essere predette considerando le caratteristiche locali di porzioni del genoma, ma è necessario considerare il genoma nella sua interezza.
Exploring genome organization: data science analysis of CTCF motif distribution and prediction of Topological Associating Domain boundaries
BUONAGURIO, CHIARA
2022/2023
Abstract
Topologically Associating Domains (TADs) are genomic regions having the peculiarity that enhancers and gene promoters mainly interact only within the same TAD, making TADs fundamental for the process of gene expression. TAD boundaries are enriched with sites binding for a protein called CTCF and these sites are asymmetric: they will be called forward sites if their motif appears from left to right (>), or reverse sites in case they appear from right to left (<). This work is built upon a prior research by Nanni et al. and it aims at disclosing whether or not the phenomenon observed in the human can be observed in the mouse. The cited phenomenon consists in CTCF-binding sites having a peculiar distribution around TAD boundaries: divergent (<>) sites are enriched and convergent (><) sites are depleted at TAD boundaries, while forward and reverse sites peak respectively on the right and on the left side of boundaries. This work identified the positions of CTCF-binding sites in the human and mouse genomes and studied their characteristics, revealing that the orientations of these sites follow a specific grammar, where forward and reverse sites must appear roughly the same number of times, and the same goes for convergent (><) and divergent (<>) sites. For the mouse a set of consensus boundaries was obtained and the orientations of the CTCF-binding sites around these boundaries was studied, which confirmed the observations made on the human: divergent sites are enriched at TAD boundaries, while convergent sites are depleted, and forward and reverse sites tend to locate respectively to the right and to the left of boundaries. Afterwards two Machine Learning models, the Random Forest and the Convolutional Neural Network, were used to try and predict the positions of boundaries given the positions and orientations of CTCF-binding sites. These models were not able to accurately predict the boundaries' positions. Another approach providing satisfactory results is thus proposed in this work: TAD boundaries' positions are predicted using a sequence of a Simulated Annealing algorithm and a greedy algorithm, starting from randomly generated positions and randomly substituting them until the distributions around these new predicted boundaries resemble the ones observed around known boundaries. This leads to a new perspective: the phenomenon observed in both the human and the mouse is not strongly connected to single TAD boundaries, but it involves the organization of the entire genome. Thus the positions of TADs cannot be predicted by considering the local characteristics of portions of the genome, but the entire genome must be considered as a whole.File | Dimensione | Formato | |
---|---|---|---|
2024_04_Buonagurio_Tesi_01.pdf
non accessibile
Descrizione: Tesi
Dimensione
3.23 MB
Formato
Adobe PDF
|
3.23 MB | Adobe PDF | Visualizza/Apri |
2024_04_Buonagurio_ExecutiveSummary_02.pdf
accessibile in internet solo dagli utenti autorizzati
Descrizione: Executive Summary
Dimensione
761.55 kB
Formato
Adobe PDF
|
761.55 kB | Adobe PDF | Visualizza/Apri |
I documenti in POLITesi sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/10589/218511