Computing mutational signature exposure in human cancer: bridging the gap from whole exome to whole genome sequencing

This project aims to combine the fundamentals of genomic assignment of mutational signatures with Machine Learning (ML) models to create rules to bridge the difference between whole exome and whole genome scenarios with the aim of making signatures efficient biomarkers. First, we focused on analyzing a dataset related to breast cancer mutations in 72 patients. We developed a preprocessing procedure using tools such as VEP, SigProfilerMatrixGenerator and SigProfilerAssignament so as to create the right environment on which to do the analysis. Next, we implemented off-line a minimization method, extendable to each tumor, that would allow us to assign signatures to different samples. We then tried to adjust the whole exome mutation matrix with a simple corrective method to try to reduce the distance by the whole genome matrix. Finally, we turned our attention to neural network implementations with the goal of finding a model to derive whole genome level information based on more readily available whole exome data. In fact, whole genome assignments are more appropriate to be used as a tumor characterization tool. In developing the neural networks we started by using only the breast cancer dataset, then tried modeling using five different tumors as input, and finally trained the network with six different datasets by evaluating on a seventh with the addition of extra results by working with the seven available datasets. In this last phase devoted to ANNs we used for all three scenarios two different approaches. In the first case we wanted to find a model to go directly from the distribution, in each sample, of whole exome mutations to the assignment of signatures in the whole genome context. Through the second approach we trained the network using exposures to Signatures in the whole exome context as input and, logically as in the first case, the exposures to signatures in the whole genome setting as target.

Questo progetto mira a combinare le nozioni fondamentali dell'assegnazione genomica delle signature mutazionali con i modelli di Machine Learning (ML) per creare delle regole che permettano di colmare la differenza tra scenario whole exome e whole genome, con lo scopo di rendere le signature biomarker efficienti. In primo luogo ci siamo focalizzati sull'analisi di un dataset relativo alle mutazioni del tumore al seno in 72 pazienti. Abbiamo sviluppato una procedura di preprocessing con l'utilizzo di tools quali VEP, SigProfilerMatrixGenerator e SigProfilerAssignament in modo tale da creare il giusto ambiente su cui fare le analisi. Successivamente abbiamo implementato fuori linea un metodo di minimizzazione, estendibile a ogni tumore, che ci permettesse di assegnare le signature ai diversi campioni. Abbiamo provato quindi ad aggiustare la matrice delle mutazioni whole exome con un semplice metodo correttivo per provare a ridurre la distanza dalla matrice whole genome. In ultimo, ci siamo dedicati all'implementazioni di reti neurali con l'obbiettivo di trovare un modello che permetta di ricavare informazioni a livello whole genome basandosi su dati whole exome, più facilmente disponibili. Le assegnazioni whole genome, infatti, sono più appropriate per essere utilizzate come strumento di caratterizzazione del tumore. Nello sviluppo delle reti neurali abbiamo iniziato con l'utilizzo del solo dataset del cancro al seno, poi abbiamo provato a modellare utilizzando cinque diversi tumori come input; infine, abbiamo addestrato la rete con sei diversi dataset valutando su un settimo e con l'aggiunta di risultati extra lavorando con i sette dataset disponibili. In quest'ultima fase, dedicata alle reti neurali, abbiamo utilizzato per tutti e tre gli scenari due approcci diversi. Nel primo caso abbiamo voluto trovare un modello per passare direttamente dalla distribuzione, in ogni campione, delle mutazioni whole exome all'assegnazione delle signature nel contesto whole genome. Con il secondo approccio abbiamo addestrato la rete utilizzando come input le esposizioni alle signature nel contesto whole exome e le esposizioni alle signature nel caso whole genome come target.