Genetic diseases are pathological conditions that have a modification of the nucleotidic sequence of DNA as a predominant cause or as a necessary cause. We can distinguish between monogenic (Mendelian) and polygenic-multifactorial (complex) genetic diseases depending on if the mutation interests a single nucleotide base or more than one gene at the same time. The study of the genetic bases of diseases has received great interest in recent times, when the development of new DNA/RNA sequencing technologies have led the discovery of the exact nucleotide sequence of the genome and the description of single nucleotide polymorphisms. Each genomic position can show two or more alternative forms called alleles that control the same character and that can codify for qualitatively and/or quantitatively different products. Single nucleotide polymorphisms (SNPs) are positions that appear to be variable within a population. In diploid species like humans, a single position is described by the combination of two alleles (genotype). Homozygous genotypes contain two copies of the same nucleotide, while heterozygous genotypes contain different nucleotides. Comparing each position of the human genome with the one of chimpanzee (the closest species to humans), we can define as ancestral alleles the nucleotides that we find in the orthologous positions of the chimpanzee genome and derived alleles all the others. Individual genetic makeup is one of the strongest risk factors of a conspicuous number of complex diseases as cancer, diabetes, cardiovascular diseases, autoimmune and psychiatric. The influence of multiple allelic variants, and the interaction between the individual and the environment, make their description very complicate. However, the comprehension of the genetic bases of diseases is of fundamental clinical importance to plan effective preventions and treatment strategies. To analyse the allelic architecture of a genetic disease it is possible to conduct an association study, namely a genome-wide scan looking for correlation signals among a phenotype and some allelic variants, or to examine the genetic differences among two large groups of healthy and ill individuals. Such studies commonly identify a wide number of allelic variants that are only slightly associated with a phenotype trait and this prevents the comprehension of which are the ones really determinant for the disease. In these circumstances arises the need for a wider and deep analysis that can integrate and combine the results of studies different by nature. At this purpose, classical genetic approaches have been recently supported by population genetic studies that are based on evolutionary principles. Variation of allele frequencies in time is driven by pressure imposed by both natural selection and population histories, including changes in size and migratory events. Natural selection refers to every non-casual and differential propagation of an allelic variant, in relation to its phenotypic effect. The selective action is directional, with the propagation of the alleles that are considered beneficial (positive selection) and the elimination of the deleterious ones (negative selection). From this perspective, allelic variants that are causal for genetic diseases appear to be a substrate for natural selection and various theoretical models have been proposed to describe their genetic architecture in an evolutive contest. Natural selection inferences can add important information since allelic variants and genes that have been under selective pressure have a higher probability of being functional and therefore linked to susceptibility to certain diseases. Population genetic studies can guide association studies towards genomic regions that have been the target of selective phenomena and, on the other side, variants slightly associated with the disease can be tested for selection, with the aim of providing a wide and comprehensive analysis of the genetic basis of complex diseases. In this thesis work, we study the long QT syndrome (LQTS), a cardiovascular complex disease that is manifested by a lengthening of the QT electrocardiographic trait. QT interval is measured in the ECG tracing from the beginning of the QRS complex to the end of the wave T and it matches the depolarization and repolarization events of myocardial tissue. When QT trait is longer than normal, the probabilities of tachycardia and ventricular fibrillation increase. These cardiac anomalies are a strong risk factor for the sudden cardiac death (SCD), an event that is among the major causes of mortality in developed countries and that most of the time happens on individuals that were not aware of being predisposed to it. The addressing of the genetic variants that determine the QT interval length can have a deep clinical impact, allowing the development of genetic screening tests. Recent association studies with the long QT phenotype have identified a wide spectrum of allelic variants of different genes (NOS1AP, KCNQ1, KCNH2, SCN5A, KCNJ2 among others) but it is not yet clear which are the ones that are causal and determinant for this trait. The analysis of the problem from an evolutionary perspective gave rise to the hypothesis that some ancestral allelic variants of NOS1AP gene, associated with shorter QT interval, show signals of positive selection in individuals of European origin. New and more detailed evolutionary inferences can focus the investigation of the genetic bases of LQTS around the mutations that have been a target of natural selection and so they have a higher probability of being functional. The positive selection of an allele that is considered evolutionarily beneficial determines an increase in its frequency inside the population until it reaches high prevalence and even the fixation (frequency of 100%). The effect embraces also to the genomic region around and all the variants that have a connection with the selected one (linkage disequilibrium) rise in frequency. The signature left by positive selection on a genomic region is called a selective sweep and consists of a reduction of the haplotypic diversity of the population around the causal variant. A haplotype is defined as the sequence of alleles that are in linkage disequilibrium. The research of this pattern inside the genome has been historically based on the calculation of summary statistics on genetic data and on the comparison between the observed genetic data and the expectations under the null hypothesis of neutrality. Each statistic focuses on a single aspect of the signal and its detection power drastically decreases in presence of confounding effects that are determined by the demographic history of the population and the characteristics of the examined genetic region. The most advanced methods of selective pattern research in the genome rely on the application of machine learning algorithms to population genetic summary statistics. In addition, they primarily make inferences of classification of a genetic region among neutrality and diverse modalities of positive selection (soft sweep and hard sweep). In this thesis, we implemented a convolutional neural network (CNN41) that allows the estimation of the positive selective pressure entity that acted on a region of the human genome. At the state of the art in population genetics, it is presented as a totally innovative research method both from a conceptual and a methodological point of view. The basic idea of the project is the neglection of summary statistics in favour of a direct translation of genetic data into images, so that all the information contained can be exploited to make predictions. The most advanced machine learning tool for the detection of objects, images classification and segmentation is today represented by convolutional neural networks (CNN). They belong to the category of deep neural networks and they can be visualized as tridimensional volumes of artificial neuronal units in which the connectivity pattern is inspired from the visual animal cortex. The neural connections are tuned in the training phase of the network so that it learns to recognise the informative patterns in input data that are linked to the desired output. In the case of classification, the input image is first numerically represented as a matrix in which each element describes the colour of each pixel. As the image passes through the network layers, it undergoes sequential transformations that end up with the prediction of the output class. A convolutional neural network is typically made by one or more convolutional layers in which a series of small matrices of weights called filters analyse narrow portions of the input image, as each biological neuron of the visual cortex focuses on its receptive field. The convolution of each filter with the image is interpretable as the research of a certain feature and it results in a neural activation map that represents the filtered input image. Each convolutional layer is followed by a pooling layer that decreases the dimensionality of the activation maps but it keeps the potentially important discriminatory information obtained by the previous layers. Finally, the matrices are flattened and fed to one or more fully connected layers that have the same architecture of a classical artificial neural network (ANN). This last part of CNN makes a high-level synthesis of the information and shows the final prediction. The number of neurons of the last fully connected layer equals the number of classes that are defined in the specific classification problem. Our input to the convolutional neural network is represented by aligned genetic sequences that are translated in images in which black and white pixels respectively encode for derived and ancestral alleles. The genetic sequences of individuals of the examined population are arranged on the rows of the image while the columns are the genetic positions of the alignment. To highlight the pattern of a selective sweep, the images are subjected to a processing of ordering the haplotypes (rows) by similarity and occurrence. The convolutional neural network implemented in this thesis work carries out a classification task with 41 output classes that correspond to the discretization of a certain range of the selection coefficient, a continuous parameter difficult to estimate, that is an index of the selective pressure intensity. The 41 output probabilities are interpretable as the posterior probability distribution of the selection coefficient related to the analysed genetic region. The originality of this thesis work is the translation of genetic data into images and the estimate of a continuous parameter through the resolution of a classification problem with a high number of output classes. Convolutional neural networks are supervised learning algorithms and so they need to be trained on a dataset of images for which the information of the associated belonging class is given (labelled training set), before being able of making predictions on new images of interest (unlabeled test set). The training of the network is accomplished by adjusting the internal parameters (called weights and biases) of each neuron so that the behaviour of the whole model is to maximize the prediction accuracy. This tuning occurs with the back-propagation algorithm, which feeds a small number of images of the training set through the network and then estimates the error gradient on the output vectors produced for these examples. The error gradient is then propagated in reverse through the network and a given hidden neuron’s contribution to the error is proportional to the linear combination of its weight vector and the errors associated with all the neuron in the next layers. The parameters of each neuron are then updated with an algorithm of stochastic gradient descent. This process is repeated until each training example has been fed through the network, marking the completion of a single training epoch. Training continues until a specified stopping criterion is reached, that is typically a predefined number of epochs or an improvement of the network performance that is under a certain threshold. Given the high complexity of the classification task of a genetic region among 41 classes of the selection coefficient, the training set of CNN41 is made by more than 100,000 images from genetic data obtained with a software for extensive genetic simulations (MSMS). The first experimental part has been focused on the exploration of various translation possibilities of aligned genetic sequences into images, considering single or multi-population data and using diverse colour codes, black and white and coloured "CMYK" with four channels, one for each DNA base (Adenine, Cytosine, Guanine and Thymine). The selection of black and white (derived/ancestral) single population images has been followed by an experimental preliminary phase with convolutional neural networks in which we implemented networks for the binary classification of a genomic region between neutrality and positive selection. We made experiments to optimize the image processing, considering rows and/or columns orderings by similarity and occurrence. The predictive abilities of CNN have been compared to the ones of a classical analysis method that makes inferences using the summary statistics. At this aim, we implemented a Support Vector Machine (SVM), a supervised machine learning algorithm for binary classification problems, regression and outlier detection. Each example of the training set is a replicate from the genetic simulations that SVM represents as a point in the space with coordinates that are the relative summary statistics Tajima’s D and nSL. The training ends with the finding of the best separating surface between points that belong to the two classes of the problem, that are neutrality and positive selection. We compared the predictive abilities of CNN and SVM also when the demographic model of the population is not correct or misspecified. In this first experimental phase, we obtained encouraging results. It has been followed by the implementation of a convolutional neural network with 41 output classes (CNN41) and its application to the research of selective patterns around the allelic variants that are associated with the long QT syndrome. We tested for selection the promoter region of the NOS1AP gene, given the already suggested hypothesis of positive selection and the high signal of association with the lengthening of the QT interval. The selection coefficient estimated for the allelic variant rs10918594 is almost three times higher than the one relative to rs12143842; despite rs12143842 is the most associated with the phenotype, given these inferences of positive selection we propose rs10918594 as the variant that is the most probable cause of LQTS. However, we believe that to reach such important conclusions on real genetic data we need to fine-tune the training of CNN41 with genetic simulations that take into account the demographic history of the European population and extend the selection test to all the allelic variants that the previous studies have linked to the LQTS. To conclude, CNN41 is a valid and innovative instrument of research for natural selection signals in the human genome and offers its contribution to the identification of the genes and the allelic variants that might be causal for complex genetic diseases. It will be possible to apply this tool to genome-wide analysis or on specific genetic regions of clinical interest. This approach has the potential to shed light onto the epidemiology and pathophysiology of genetic diseases and could give the possibility of developing preventive genetic test of screening and drugs of personalized medicine.
Le malattie genetiche sono condizioni morbose che hanno come causa predominante o come concausa necessaria una modificazione a carico della sequenza nucleotidica del DNA, comunemente chiamata anche genoma. Possono essere distinte in monogeniche (mendeliane) e poligeniche-multifattoriali (complesse) a seconda che la mutazione sia a carico di un singolo nucleotide o interessi più geni contemporaneamente. Lo studio delle basi genetiche delle malattie ha avuto grande slancio in tempi recenti, quando lo sviluppo di nuove tecnologie di sequenziamento genetico ha permesso la determinazione dell’esatta successione nucleotidica e la descrizione dei polimorfismi a singolo nucleotide. Ogni posizione del genoma può mostrare due o più forme alternative (alleli) che controllano lo stesso carattere e che possono codificare per prodotti qualitativamente e/o quantitativamente diversi. Si dicono polimorfismi a singolo nucleotide (Single Nucleotide Polymorphism, SNP) le posizioni che appaiono variabili all’interno della popolazione. In organismi diploidi come gli esseri umani un singolo carattere è descritto dalla combinazione di due alleli (genotipo), se questi sono uguali parliamo di omozigosi e se sono diversi di eterozigosi. Confrontando ogni posizione del genoma umano con quello dello scimpanzé (la specie più vicina a quella umana), possiamo definire alleli ancestrali i nucleotidi ritrovati nelle ortologhe posizioni del genoma dello scimpanzé e alleli derivati tutti gli altri. Il corredo genetico individuale è uno dei più forti fattori di rischio di un cospicuo numero di malattie complesse quali cancro, diabete, malattie cardiovascolari, autoimmuni e psichiatriche. L’influenza di multiple varianti alleliche, a cui deve essere sommata l’interazione tra l’individuo e l’ambiente, rendono la loro descrizione molto complicata. Tuttavia, la comprensione delle basi genetiche di queste malattie è di fondamentale importanza clinica per poter delineare delle strategie di prevenzione e cura sempre più mirate ed efficaci. Per analizzare l’architettura allelica di una malattia genetica è possibile condurre uno studio di associazione, ovvero una ricerca sull’intero genoma di segnali di correlazione tra un fenotipo e le varianti alleliche, oppure esaminare le differenze genetiche tra due ampi gruppi di individui sani e individui malati. Studi genetici di questo tipo identificano comunemente un gran numero di varianti alleliche solo debolmente associate ad un certo tratto fenotipico e questo ostacola la comprensione di quali tra esse siano quelle realmente determinanti per la malattia, informazione indispensabile in un contesto clinico. In queste circostanze si presenta la necessità di un’analisi più ampia e profonda che possa integrare e combinare i risultati di studi di diversa natura. A questo proposito, gli approcci genetici classici sono stati recentemente affiancati da studi di genetica di popolazione che pongono le proprie basi su principi evoluzionistici. Il destino delle varianti alleliche legate ai fenotipi è determinato dalle pressioni esercitate dalla selezione naturale e dalla storia demografica delle popolazioni, includendo sia variazioni nella loro dimensione che nella struttura geografica. La selezione naturale agisce a livello della trasmissione dei genotipi e si riferisce a qualsiasi propagazione non casuale e differenziale di una variante allelica, in relazione al suo effetto fenotipico. L’azione selettiva è direzionale, con la propagazione degli alleli ritenuti beneficiali (selezione positiva) e l’eliminazione di quelli deleteri (selezione negativa). In quest’ottica, le varianti alla base delle malattie genetiche appaiono un substrato ideale per la selezione naturale e vari modelli teorici sono stati proposti per descriverne l’architettura in un contesto evolutivo. Le inferenze di selezione naturale possono apportare delle importanti informazioni in quanto geni e varianti alleliche che sono stati sottoposti a delle pressioni selettive hanno una probabilità più alta di essere funzionali e dunque legati alla malattia. Gli studi di genetica di popolazione possono guidare quelli di associazione verso regioni genomiche che sono state oggetto di fenomeni selettivi e viceversa, varianti debolmente associate alla malattia possono essere testate per la selezione, al fine di fornire un’analisi ampia e comprensiva delle basi genetiche delle malattie complesse. In questo lavoro di tesi proponiamo lo studio della sindrome del QT lungo (LQTS), una malattia cardiovascolare complessa che si manifesta con un allungamento del tratto elettrocardiografico QT. L’intervallo QT si misura nel tracciato ECG dall’inizio del complesso QRS alla fine dell’onda T e corrisponde al tempo di depolarizzazione e ripolarizzazione del tessuto miocardico. Quando la sua lunghezza appare prolungata è più alta la probabilità di insorgenza di tachicardie e fibrillazioni ventricolari. Queste anomalie cardiache sono un forte fattore di rischio per la morte cardiaca improvvisa (Sudden Cardiac Death, SCD), un evento che si pone tra le maggiori cause di mortalità nei paesi sviluppati e che si verifica in individui che nella maggior parte dei casi non erano a conoscenza del fatto di essere a rischio. L’identificazione delle varianti genetiche che determinano la lunghezza dell’intervallo QT potrebbe avere un forte impatto clinico, permettendo lo sviluppo di test genetici di screening preventivo. I recenti studi di associazione con il fenotipo del QT lungo hanno individuato un ampio spettro di varianti alleliche appartenenti a diversi geni (NOS1AP, KCNQ1, KCNH2, SCN5A, KCNJ2 e altri) ma non è ancora chiaro quali possano essere gli alleli causali e determinanti di questo tratto. Da un’analisi del problema in una prospettiva evoluzionistica è stata avanzata l’ipotesi che alcune varianti alleliche ancestrali del gene NOS1AP, associate con durate più brevi dell’intervallo QT, mostrino dei segnali di selezione positiva in individui di origine Europea. Nuove e più dettagliate inferenze evoluzionistiche possono circoscrivere l’indagine delle basi genetiche della sindrome del QT lungo attorno a quelle mutazioni che, essendo state un target per la selezione naturale, hanno un’elevata probabilità di essere funzionali. La selezione positiva su un allele considerato evolutivamente beneficiale determina un aumento della sua frequenza all’interno della popolazione fino a raggiungere alta prevalenza e talvolta fissazione (frequenza del 100%). L’effetto si estende anche alla regione genetica circostante e le varianti che hanno dei legami con quella selezionata (linkage disequilibrium) aumentano in frequenza. La traccia lasciata dalla selezione positiva su una regione genomica si dice sweep selettivo ed è una riduzione della diversità aplotipica della popolazione attorno alla variante causale, dove per aplotipo si intende la sequenza di alleli che sono tra loro in linkage. La ricerca di questo pattern è stata storicamente basata sul calcolo di statistiche (summary statistics) sui dati genetici e sul loro confronto tra i dati osservati e le attese nell’ipotesi nulla di neutralità. Ogni statistica inquadra aspetti singoli del segnale e il suo potere detettivo si riduce drasticamente in presenza di effetti confondenti determinati dalla storia demografica della popolazione e dalle caratteristiche proprie della regione genetica in esame. I metodi di frontiera per la ricerca dei pattern selettivi nel genoma applicano degli algoritmi di machine learning al substrato classico della genetica di popolazione rappresentato dalle summary statistics ed effettuano primariamente delle inferenze di classificazione della regione genetica in esame tra neutralità e diverse modalità di selezione positiva (soft sweep e hard sweep). Questo lavoro di tesi ha visto l’implementazione di una rete neurale convoluzionale (CNN41) che permette di effettuare una stima dell’entità della pressione selettiva positiva che ha agito su una regione del genoma umano. Allo stato dell’arte della genetica di popolazione si presenta come un metodo di ricerca del tutto innovativo sia dal punto di vista concettuale che metodologico. L’idea fondante del progetto è l’abbandono del calcolo delle summary statistics in favore di una diretta traduzione dei dati genetici in immagini, cosicché tutta l’informazione in essi contenuta possa essere utilizzata a scopo predittivo. Il più avanzato strumento di machine learning per la detezione di oggetti, la classificazione e la segmentazione di immagini è ad oggi rappresentato dalle reti neurali convoluzionali (Convolutional Neural Network, CNN). Appartenenti alla categoria delle reti neurali artificiali profonde, sono visualizzabili come volumi tridimensionali di unità neuronali il cui pattern di connettività è ispirato alla corteccia visiva animale. Le connessioni neuronali vengono regolate durante la fase di addestramento (training) della rete affinché questa impari a riconoscere dei pattern informativi nei dati di input che gli permettono di produrre l’output desiderato. Nel caso in cui il compito sia di classificazione, l’immagine in ingresso viene in prima istanza rappresentata numericamente come una matrice i cui elementi sono descrittivi del colore di ogni pixel. Passando poi attraverso gli strati neuronali (layer) che compongono la rete, subisce delle trasformazioni che si concludono con la predizione della classe di appartenenza. Una rete neurale convoluzionale è tipicamente formata da uno o più strati convoluzionali in cui una serie di piccole matrici di pesi chiamate filtri esaminano ristrette porzioni dell’immagine di ingresso, così come ogni neurone della corteccia visiva si concentra sul proprio campo recettivo. La convoluzione di ogni filtro con l’immagine è interpretabile come la ricerca di una certa caratteristica (feature) e dà origine a una mappa di attivazione neuronale che rappresenta l’immagine di ingresso filtrata. Ad ogni strato convoluzionale segue uno strato di pooling il cui compito è ridurre la dimensionalità delle mappe di attivazione che riceve ma conservando le informazioni potenzialmente importanti che sono state estratte dai layer precedenti. Le matrici vengono infine rese monodimensionali e passate a uno o più strati neurali a connessione completa la cui struttura rispecchia quella di una rete neurale artificiale classica (Artificial Neural Network, ANN). Quest’ultima porzione di CNN effettua una sintesi di alto livello delle informazioni per la presentazione della predizione finale da parte dei neuroni dell’ultimo strato, il cui numero corrisponde alle classi di output definite all’interno del problema di classificazione in esame. Il nostro ingresso alla rete neurale convoluzionale è rappresentato da sequenze genetiche di allineamento tradotte in immagini in cui pixel bianchi e neri codificano rispettivamente per alleli ancestrali e derivati. Le sequenze degli individui della popolazione in esame sono disposti sulle righe dell’immagine e le colonne corrispondono dunque alle posizioni genetiche dell’allineamento. Per mettere in risalto il pattern di uno sweep selettivo, le immagini subiscono un processamento che consiste nell’ordinare gli aplotipi (righe) per similarità e occorrenza. La rete convoluzionale implementata in questo lavoro di tesi esegue un compito di classificazione a 41 classi di output che corrispondono alla discretizzazione di un certo range del coefficiente di selezione, un parametro continuo di difficile previsione che è indice dell’intensità della pressione selettiva. Le probabilità di appartenenza di un’immagine alle 41 classi di output sono interpretabili come la distribuzione di probabilità a posteriori del coefficiente di selezione relativo alla regione genetica in esame. La traduzione dei dati genetici in immagini e la stima di un parametro continuo con la risoluzione di un problema di classificazione ad un elevato numero di classi di output costituiscono l’originalità di questo lavoro di tesi. Le reti neurali convoluzionali sono algoritmi di apprendimento automatico supervisionato ed in quanto tali devono essere addestrate su un dataset di immagini (training set) per le quali sia fornita l’informazione della classe di appartenenza prima di essere in grado di effettuare delle predizioni sulle immagini di interesse (test set). L’addestramento della rete è compiuto aggiustando i parametri (detti pesi e bias) di ogni neurone artificiale in modo che il comportamento del modello complessivo sia quello di massimizzare l’accuratezza delle predizioni. L’algoritmo utilizzato è la retropropagazione dell’errore che consiste nella presentazione di un certo numero di immagini del training set alla rete e nel calcolo del gradiente dell’errore di predizione compiuto; segue una fase di propagazione all’indietro in cui ad ogni neurone dei layer interni viene assegnato il proprio contributo all’errore complessivo, calcolato come una combinazione lineare dei pesi del neurone e degli errori di tutti i neuroni degli strati successivi. I parametri di ogni neurone vengono infine aggiornati con un algoritmo di discesa del gradiente stocastica. Questo processo si ripete finché tutte le immagini che compongono il training set non hanno attraversato la rete, concludendo così un’epoca di training. L’addestramento si compie epoca dopo epoca finché non viene raggiunta una condizione di arresto, che è tipicamente un numero predefinito di epoche o un miglioramento delle performance che si pone al di sotto di una certa soglia. Data l’elevata complessità del compito di classificazione di una regione genetica in 41 classi del coefficiente di selezione, il training set di addestramento di CNN41 è costituito da più di 100,000 immagini di dati ottenuti con un software di simulazioni genetiche estensive (MSMS). La prima parte sperimentale è stata dedicata all’esplorazione delle varie possibilità di traduzione dei dati genetici di allineamento in immagini, considerando dati di singola popolazione o multi-popolazione e utilizzando codifiche di colore sia in bianco e nero che ’CMYK’ a quattro canali, uno per ogni base del DNA (Adenina, Citosina, Guanina e Timina). La selezione di immagini di singola popolazione in bianco e nero (alleli ancestrali/derivati) ha dato avvio ad una fase di preliminare sperimentazione con le reti neurali convoluzionali che ha visto l’implementazione di reti per la classificazione binaria di una regione genomica tra neutralità e selezione positiva. Sono stati condotti esperimenti per l’ottimizzazione del processamento dell’immagine, considerando ordinamenti di righe e/o colonne per similarità ed occorrenza. Le capacità predittive di CNN sono state messe a confronto con quelle di un metodo di analisi classico che fondi le proprie inferenze sulle summary statistics. A tale scopo è stata implementata una macchina a vettori di supporto (Support Vector Machine, SVM), un algoritmo di machine learning ad apprendimento supervisionato per la risoluzione di problemi di classificazione binaria, regressione e detezione di outlier. Ogni esempio del training set è un replicato delle simulazioni genomiche che SVM rappresenta nel piano come un punto le cui coordinate sono le relative statistiche Tajima’s D e nSL. L’addestramento termina con la definizione della miglior superficie di separazione tra punti appartenenti alle due classi del problema, neutralità e selezione positiva. Le capacità predittive di CNN e di SVM sono state messe a confronto anche in scenari in cui il modello demografico della popolazione in esame non è corretto o mispecificato. Gli ottimi risultati ottenuti in questa prima fase sperimentale hanno incoraggiato l’implementazione della rete neurale convoluzionale CNN41 a 41 classi di output e la sua applicazione nella ricerca di patter selettivi attorno a varianti alleliche associate alla sindrome del QT lungo. I test sono stati condotti sulla regione promotrice del gene NOS1AP considerati le già presenti ipotesi selettive e l’alto segnale di associazione con l’allungamento del tratto QT. Il coefficiente di selezione stimato per la variante allelica rs10918594 è quasi tre volte superiore rispetto a quello relativo a rs12143842; nonostante quest’ultima variante sia quella più associata al fenotipo, alla luce di queste inferenze selettive proponiamo rs10918594 quale più probabile variante causale per la sindrome del QT lungo. Tuttavia, riteniamo che per effettuare delle inferenze di questo calibro su dati genetici reali sia necessario perfezionare l’addestramento di CNN41 con delle simulazioni genetiche che tengano conto della storia demografica propria della popolazione Europea ed estendere il test di selezione a tutte le varianti alleliche che gli studi precedenti hanno associato alla sindrome del QT lungo. Concludendo, CNN41 è un valido e innovativo strumento di ricerca dei segnali di selezione naturale nel genoma umano e offre il proprio contributo nell’identificazione dei geni e delle varianti alleliche causali per malattie genetiche complesse. Sarà possibile applicare questo strumento su larga scala in analisi genome-wide o su regioni genetiche specifiche di interesse clinico. Questo tipo di investigazione potrebbe fornire nuove conoscenze riguardo l’epidemiologia e la patofisiologia delle malattie genetiche, nonché la possibilità di progettare dei test genetici di screening preventivo e farmaci di medicina personalizzata.
Reti neurali convoluzionali per l'identificazione di segnali di selezione naturale e mutazioni funzionali nel genoma umano
LORENZON, LUCREZIA
2017/2018
Abstract
Genetic diseases are pathological conditions that have a modification of the nucleotidic sequence of DNA as a predominant cause or as a necessary cause. We can distinguish between monogenic (Mendelian) and polygenic-multifactorial (complex) genetic diseases depending on if the mutation interests a single nucleotide base or more than one gene at the same time. The study of the genetic bases of diseases has received great interest in recent times, when the development of new DNA/RNA sequencing technologies have led the discovery of the exact nucleotide sequence of the genome and the description of single nucleotide polymorphisms. Each genomic position can show two or more alternative forms called alleles that control the same character and that can codify for qualitatively and/or quantitatively different products. Single nucleotide polymorphisms (SNPs) are positions that appear to be variable within a population. In diploid species like humans, a single position is described by the combination of two alleles (genotype). Homozygous genotypes contain two copies of the same nucleotide, while heterozygous genotypes contain different nucleotides. Comparing each position of the human genome with the one of chimpanzee (the closest species to humans), we can define as ancestral alleles the nucleotides that we find in the orthologous positions of the chimpanzee genome and derived alleles all the others. Individual genetic makeup is one of the strongest risk factors of a conspicuous number of complex diseases as cancer, diabetes, cardiovascular diseases, autoimmune and psychiatric. The influence of multiple allelic variants, and the interaction between the individual and the environment, make their description very complicate. However, the comprehension of the genetic bases of diseases is of fundamental clinical importance to plan effective preventions and treatment strategies. To analyse the allelic architecture of a genetic disease it is possible to conduct an association study, namely a genome-wide scan looking for correlation signals among a phenotype and some allelic variants, or to examine the genetic differences among two large groups of healthy and ill individuals. Such studies commonly identify a wide number of allelic variants that are only slightly associated with a phenotype trait and this prevents the comprehension of which are the ones really determinant for the disease. In these circumstances arises the need for a wider and deep analysis that can integrate and combine the results of studies different by nature. At this purpose, classical genetic approaches have been recently supported by population genetic studies that are based on evolutionary principles. Variation of allele frequencies in time is driven by pressure imposed by both natural selection and population histories, including changes in size and migratory events. Natural selection refers to every non-casual and differential propagation of an allelic variant, in relation to its phenotypic effect. The selective action is directional, with the propagation of the alleles that are considered beneficial (positive selection) and the elimination of the deleterious ones (negative selection). From this perspective, allelic variants that are causal for genetic diseases appear to be a substrate for natural selection and various theoretical models have been proposed to describe their genetic architecture in an evolutive contest. Natural selection inferences can add important information since allelic variants and genes that have been under selective pressure have a higher probability of being functional and therefore linked to susceptibility to certain diseases. Population genetic studies can guide association studies towards genomic regions that have been the target of selective phenomena and, on the other side, variants slightly associated with the disease can be tested for selection, with the aim of providing a wide and comprehensive analysis of the genetic basis of complex diseases. In this thesis work, we study the long QT syndrome (LQTS), a cardiovascular complex disease that is manifested by a lengthening of the QT electrocardiographic trait. QT interval is measured in the ECG tracing from the beginning of the QRS complex to the end of the wave T and it matches the depolarization and repolarization events of myocardial tissue. When QT trait is longer than normal, the probabilities of tachycardia and ventricular fibrillation increase. These cardiac anomalies are a strong risk factor for the sudden cardiac death (SCD), an event that is among the major causes of mortality in developed countries and that most of the time happens on individuals that were not aware of being predisposed to it. The addressing of the genetic variants that determine the QT interval length can have a deep clinical impact, allowing the development of genetic screening tests. Recent association studies with the long QT phenotype have identified a wide spectrum of allelic variants of different genes (NOS1AP, KCNQ1, KCNH2, SCN5A, KCNJ2 among others) but it is not yet clear which are the ones that are causal and determinant for this trait. The analysis of the problem from an evolutionary perspective gave rise to the hypothesis that some ancestral allelic variants of NOS1AP gene, associated with shorter QT interval, show signals of positive selection in individuals of European origin. New and more detailed evolutionary inferences can focus the investigation of the genetic bases of LQTS around the mutations that have been a target of natural selection and so they have a higher probability of being functional. The positive selection of an allele that is considered evolutionarily beneficial determines an increase in its frequency inside the population until it reaches high prevalence and even the fixation (frequency of 100%). The effect embraces also to the genomic region around and all the variants that have a connection with the selected one (linkage disequilibrium) rise in frequency. The signature left by positive selection on a genomic region is called a selective sweep and consists of a reduction of the haplotypic diversity of the population around the causal variant. A haplotype is defined as the sequence of alleles that are in linkage disequilibrium. The research of this pattern inside the genome has been historically based on the calculation of summary statistics on genetic data and on the comparison between the observed genetic data and the expectations under the null hypothesis of neutrality. Each statistic focuses on a single aspect of the signal and its detection power drastically decreases in presence of confounding effects that are determined by the demographic history of the population and the characteristics of the examined genetic region. The most advanced methods of selective pattern research in the genome rely on the application of machine learning algorithms to population genetic summary statistics. In addition, they primarily make inferences of classification of a genetic region among neutrality and diverse modalities of positive selection (soft sweep and hard sweep). In this thesis, we implemented a convolutional neural network (CNN41) that allows the estimation of the positive selective pressure entity that acted on a region of the human genome. At the state of the art in population genetics, it is presented as a totally innovative research method both from a conceptual and a methodological point of view. The basic idea of the project is the neglection of summary statistics in favour of a direct translation of genetic data into images, so that all the information contained can be exploited to make predictions. The most advanced machine learning tool for the detection of objects, images classification and segmentation is today represented by convolutional neural networks (CNN). They belong to the category of deep neural networks and they can be visualized as tridimensional volumes of artificial neuronal units in which the connectivity pattern is inspired from the visual animal cortex. The neural connections are tuned in the training phase of the network so that it learns to recognise the informative patterns in input data that are linked to the desired output. In the case of classification, the input image is first numerically represented as a matrix in which each element describes the colour of each pixel. As the image passes through the network layers, it undergoes sequential transformations that end up with the prediction of the output class. A convolutional neural network is typically made by one or more convolutional layers in which a series of small matrices of weights called filters analyse narrow portions of the input image, as each biological neuron of the visual cortex focuses on its receptive field. The convolution of each filter with the image is interpretable as the research of a certain feature and it results in a neural activation map that represents the filtered input image. Each convolutional layer is followed by a pooling layer that decreases the dimensionality of the activation maps but it keeps the potentially important discriminatory information obtained by the previous layers. Finally, the matrices are flattened and fed to one or more fully connected layers that have the same architecture of a classical artificial neural network (ANN). This last part of CNN makes a high-level synthesis of the information and shows the final prediction. The number of neurons of the last fully connected layer equals the number of classes that are defined in the specific classification problem. Our input to the convolutional neural network is represented by aligned genetic sequences that are translated in images in which black and white pixels respectively encode for derived and ancestral alleles. The genetic sequences of individuals of the examined population are arranged on the rows of the image while the columns are the genetic positions of the alignment. To highlight the pattern of a selective sweep, the images are subjected to a processing of ordering the haplotypes (rows) by similarity and occurrence. The convolutional neural network implemented in this thesis work carries out a classification task with 41 output classes that correspond to the discretization of a certain range of the selection coefficient, a continuous parameter difficult to estimate, that is an index of the selective pressure intensity. The 41 output probabilities are interpretable as the posterior probability distribution of the selection coefficient related to the analysed genetic region. The originality of this thesis work is the translation of genetic data into images and the estimate of a continuous parameter through the resolution of a classification problem with a high number of output classes. Convolutional neural networks are supervised learning algorithms and so they need to be trained on a dataset of images for which the information of the associated belonging class is given (labelled training set), before being able of making predictions on new images of interest (unlabeled test set). The training of the network is accomplished by adjusting the internal parameters (called weights and biases) of each neuron so that the behaviour of the whole model is to maximize the prediction accuracy. This tuning occurs with the back-propagation algorithm, which feeds a small number of images of the training set through the network and then estimates the error gradient on the output vectors produced for these examples. The error gradient is then propagated in reverse through the network and a given hidden neuron’s contribution to the error is proportional to the linear combination of its weight vector and the errors associated with all the neuron in the next layers. The parameters of each neuron are then updated with an algorithm of stochastic gradient descent. This process is repeated until each training example has been fed through the network, marking the completion of a single training epoch. Training continues until a specified stopping criterion is reached, that is typically a predefined number of epochs or an improvement of the network performance that is under a certain threshold. Given the high complexity of the classification task of a genetic region among 41 classes of the selection coefficient, the training set of CNN41 is made by more than 100,000 images from genetic data obtained with a software for extensive genetic simulations (MSMS). The first experimental part has been focused on the exploration of various translation possibilities of aligned genetic sequences into images, considering single or multi-population data and using diverse colour codes, black and white and coloured "CMYK" with four channels, one for each DNA base (Adenine, Cytosine, Guanine and Thymine). The selection of black and white (derived/ancestral) single population images has been followed by an experimental preliminary phase with convolutional neural networks in which we implemented networks for the binary classification of a genomic region between neutrality and positive selection. We made experiments to optimize the image processing, considering rows and/or columns orderings by similarity and occurrence. The predictive abilities of CNN have been compared to the ones of a classical analysis method that makes inferences using the summary statistics. At this aim, we implemented a Support Vector Machine (SVM), a supervised machine learning algorithm for binary classification problems, regression and outlier detection. Each example of the training set is a replicate from the genetic simulations that SVM represents as a point in the space with coordinates that are the relative summary statistics Tajima’s D and nSL. The training ends with the finding of the best separating surface between points that belong to the two classes of the problem, that are neutrality and positive selection. We compared the predictive abilities of CNN and SVM also when the demographic model of the population is not correct or misspecified. In this first experimental phase, we obtained encouraging results. It has been followed by the implementation of a convolutional neural network with 41 output classes (CNN41) and its application to the research of selective patterns around the allelic variants that are associated with the long QT syndrome. We tested for selection the promoter region of the NOS1AP gene, given the already suggested hypothesis of positive selection and the high signal of association with the lengthening of the QT interval. The selection coefficient estimated for the allelic variant rs10918594 is almost three times higher than the one relative to rs12143842; despite rs12143842 is the most associated with the phenotype, given these inferences of positive selection we propose rs10918594 as the variant that is the most probable cause of LQTS. However, we believe that to reach such important conclusions on real genetic data we need to fine-tune the training of CNN41 with genetic simulations that take into account the demographic history of the European population and extend the selection test to all the allelic variants that the previous studies have linked to the LQTS. To conclude, CNN41 is a valid and innovative instrument of research for natural selection signals in the human genome and offers its contribution to the identification of the genes and the allelic variants that might be causal for complex genetic diseases. It will be possible to apply this tool to genome-wide analysis or on specific genetic regions of clinical interest. This approach has the potential to shed light onto the epidemiology and pathophysiology of genetic diseases and could give the possibility of developing preventive genetic test of screening and drugs of personalized medicine.File | Dimensione | Formato | |
---|---|---|---|
2018_7_Lorenzon.pdf
non accessibile
Descrizione: Testo della tesi
Dimensione
4.68 MB
Formato
Adobe PDF
|
4.68 MB | Adobe PDF | Visualizza/Apri |
I documenti in POLITesi sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/10589/141585