Viral genomics is a branch of science positioned at the intersection of biological research and computer science. It holds a pivotal role in understanding and predicting the behavior of viruses. Challenges in viral genomics encompass the development and application of computational methodologies to analyze vast datasets of viral genomes. The overarching goal is to unravel complex genomic architectures, delineate evolutionary dynamics, and employ computational techniques for predicting virulence factors and potential therapeutic targets. However, the field faces significant challenges, particularly due to the inherent variability of viruses. Each virus exhibits diversity in genome structures, sizes, and features. Additionally, viruses can evolve, generate thousands of variants, leading to rapid dissemination and potentially leap to new host species. The exponential growth in data size since the introduction of Next Generation Sequencing (NGS) technology in 2008, followed by Nanopore Sequencing, has transformed the landscape of genomic data; in less than a decade, genomic data moved from scarce and costly to relatively cheap and vastly available. Despite its abundance, viral genomic data is dispersed across numerous laboratories and organizations, often lacking a common agreement on formats and standards. This thesis addresses these complexities by developing a data integration pipeline that resulted in the creation of ViruSurf, an integrated viral sequence and metadata repository. This repository provides a unified representation of viral data, independent of source or viral species. It also supports applications such as EpiSurf, an epitope repository and analysis tool, and VirusViz, a sequence analysis tool. Right after the pandemic's start, numerous research studies focusing on COVID-19 were released. However, the scattered nature of the information, expressed in natural language and duplicated or conflincting at times, impeded the practical application of their results. To support domain experts and health authorities in effectively using and organizing this knowledge, a knowledge model named CoV2K has been introduced. CoV2K integrates diverse information fields and is supported by a (knowledge) database, accessible through an API. What sets CoV2K apart is its ability to connect disparate knowledge domains, such as variants, epidemiological and clinical effects, facilitating the discovery of relationships among data entities. This model enhances the understanding of the complex nature of genomic data and supports the integration of data and knowledge across different scientific domains related to SARS-CoV-2. Building upon these foundations, the application of CoV2K for automated knowledge discovery in the context of viral genomics is demonstrated. This approach aligns with artificial reasoning methodologies, fostering increased collaboration between the artificial intelligence and biological sciences communities. The utilization of CoV2K enables the dynamic adaptation and evolution of the knowledge base, unlocking the potential for extracting meaningful insights from complex viral genomic data. In the later stages of the pandemic, the emergence of recombinant variants in SARS-CoV-2 prompted a focused effort on the rapid detection of recombination events across sequences of single RNA viruses. The proposed method demonstrates superior diagnostic accuracy and detection speed compared to manual approaches and existing software methods, offering a valuable tool for public health responses to potential novel threats. In conclusion, this thesis contributes significantly to the field of viral genomics. It addresses challenges related to data integration, knowledge modeling, automated reasoning, and the detection of viral recombinations. The applications and methodologies presented provide valuable insights for public health preparedness against evolving viral threats. Future work is hinted at to extend the novel recombination detection approach to other viruses, ensuring continuous advancements in viral surveillance and research.
La genomica virale è un campo di ricerca estremamente attuale che applica metodi computazionali e di data management allo studio dei virus, al fine di comprenderne e prevederne il comportamento, nonché ostacolarne la diffusione o sviluppare nuove cure. In questo campo, la ricerca della comprensione di fenomeni biologici complessi comporta spesso la necessità di gestire e analizzare dataset di vaste dimensioni, spesso diversificati per tipologia dei dati, qualità, completezza e formato. L'abbondanza, nonché la variabilità e la diversità dei dati genomici offrono, quindi, sfide significative alla ricerca. Le diverse specie virali si differenziano per molteplici aspetti, quali struttura, funzionamento, caratteristiche del genoma e loro ospiti; i virus possono inoltre evolvere, mutare e talvolta diffondersi in nuove specie. Inoltre, l'introduzione delle nuove tecniche di sequenziamento, quali il Next Generation Sequencing (NGS) ed il Nanopore sequencing, hanno portato ad una crescita esponenziale dei dati. Questa tesi affronta tali sfide sviluppando una pipeline di integrazione dei dati che ha portato alla creazione di un repository integrato di sequenze virali e di diverse web app per la loro analisi. Il repository include i dati provenienti dai più importanti database internazionali, fornisce una loro rappresentazione unificata, indipendentemente dalla fonte o dalla specie virale e permette analisi complesse e non direttamente possibili nei database di origine. Il repository può essere utilizzato tramite le web app ViruSurf, EpiSurf e VirusViz che permettono all'utente finale di focalizzarsi sull'analisi di, rispettivamente, insiemi arbitrari di sequenze, epitopi e mutazioni. Oltre alle sequenze virali, la pandemia di COVID-19 ha indirettamente prodotto un'abbondante quantità di studi e pubblicazioni volte al contrasto della malattia e alla comprensione del virus da cui scaturisce, il SARS-CoV-2. L'insieme di tali ricerche costituisce un asset importantissimo. Tuttavia, l'applicazione della conoscenza veicolata tramite articoli scientifici è resa difficile dalla natura frammentaria delle informazioni, spesso espresse in linguaggio naturale e talvolta duplicate o contrastanti. Per supportare il lavoro degli scienziati e delle autorità sanitarie coinvolte nello studio del virus, è stato sviluppato un knowledge model, denominato CoV2K, in grado di organizzare efficacemente le e informazioni raccolte dalla comunità scientifica attraverso canali istituzionali e articoli scientifici. Esso descrive informazioni relative SARS-CoV-2 e collega domini molto diversi come varianti, effetti clinici ed epidemiologici, facilitando la scoperta di relazioni tra le entità in esso contenute. CoV2K non è soltanto un modello dei dati, ma una knowledge base, fruibile tramite API, volta a migliorare la comprensione dei dati genomici e supportare l'integrazione con le conoscenze in diversi domini scientifici relativi al SARS-CoV-2. In questa tesi viene dimostrato un approccio basato su CoV2K per facilitare la scoperta di nuove connessioni nei dati. L'approccio si ispira alle tecniche di artificial reasoning proprie dell'intelligenza artificiale per generare deduzioni logiche tra le entità di CoV2K e rispondere ad una serie di casi di studio interessanti. Una tale metodologia migliora la possibilità di individuare relazioni difficilmente individuabili dall'uomo data la mole di informazioni e facilita la verifica di ipotesi tramite i dati che possono essere estratti da CoV2K con un linguaggio logico descrittivo. La diffusione di varianti ricombinanti di SARS-CoV-2 ha fatto emergere, inoltre, la necessità di identificare rapidamente gli eventi di ricombinazione. In questa tesi si descrive un metodo innovativo e data-driven per identificare questi eventi con maggiore accuratezza e rapidità rispetto allo stato dell'arte. In conclusione, questa tesi costituisce un contributo significativo nella genomica virale. Essa descrive degli approcci di rilevo per l'integrazione dei dati, il knowledge modeling, sviluppare metodi di logical reasoning nel contesto dei virus e presenta un metodo innovativo per identificare le sequenze ricombinanti. I metodi e le applicazioni presentati forniscono informazioni preziose per prevenire e migliorare la gestione di nuove pandemie. In futuro, l'estensione del nuovo metodo di identificazione delle ricombinazioni anche ad ulteriori specie virali potrebbe migliorare significativamente l'attività di controllo e prevenzione del rischio epidemiologico.
Methods and tools for data integration and knowledge discovery in viral genomics
Alfonsi, Tommaso
2023/2024
Abstract
Viral genomics is a branch of science positioned at the intersection of biological research and computer science. It holds a pivotal role in understanding and predicting the behavior of viruses. Challenges in viral genomics encompass the development and application of computational methodologies to analyze vast datasets of viral genomes. The overarching goal is to unravel complex genomic architectures, delineate evolutionary dynamics, and employ computational techniques for predicting virulence factors and potential therapeutic targets. However, the field faces significant challenges, particularly due to the inherent variability of viruses. Each virus exhibits diversity in genome structures, sizes, and features. Additionally, viruses can evolve, generate thousands of variants, leading to rapid dissemination and potentially leap to new host species. The exponential growth in data size since the introduction of Next Generation Sequencing (NGS) technology in 2008, followed by Nanopore Sequencing, has transformed the landscape of genomic data; in less than a decade, genomic data moved from scarce and costly to relatively cheap and vastly available. Despite its abundance, viral genomic data is dispersed across numerous laboratories and organizations, often lacking a common agreement on formats and standards. This thesis addresses these complexities by developing a data integration pipeline that resulted in the creation of ViruSurf, an integrated viral sequence and metadata repository. This repository provides a unified representation of viral data, independent of source or viral species. It also supports applications such as EpiSurf, an epitope repository and analysis tool, and VirusViz, a sequence analysis tool. Right after the pandemic's start, numerous research studies focusing on COVID-19 were released. However, the scattered nature of the information, expressed in natural language and duplicated or conflincting at times, impeded the practical application of their results. To support domain experts and health authorities in effectively using and organizing this knowledge, a knowledge model named CoV2K has been introduced. CoV2K integrates diverse information fields and is supported by a (knowledge) database, accessible through an API. What sets CoV2K apart is its ability to connect disparate knowledge domains, such as variants, epidemiological and clinical effects, facilitating the discovery of relationships among data entities. This model enhances the understanding of the complex nature of genomic data and supports the integration of data and knowledge across different scientific domains related to SARS-CoV-2. Building upon these foundations, the application of CoV2K for automated knowledge discovery in the context of viral genomics is demonstrated. This approach aligns with artificial reasoning methodologies, fostering increased collaboration between the artificial intelligence and biological sciences communities. The utilization of CoV2K enables the dynamic adaptation and evolution of the knowledge base, unlocking the potential for extracting meaningful insights from complex viral genomic data. In the later stages of the pandemic, the emergence of recombinant variants in SARS-CoV-2 prompted a focused effort on the rapid detection of recombination events across sequences of single RNA viruses. The proposed method demonstrates superior diagnostic accuracy and detection speed compared to manual approaches and existing software methods, offering a valuable tool for public health responses to potential novel threats. In conclusion, this thesis contributes significantly to the field of viral genomics. It addresses challenges related to data integration, knowledge modeling, automated reasoning, and the detection of viral recombinations. The applications and methodologies presented provide valuable insights for public health preparedness against evolving viral threats. Future work is hinted at to extend the novel recombination detection approach to other viruses, ensuring continuous advancements in viral surveillance and research.File | Dimensione | Formato | |
---|---|---|---|
2024_03_31_PhD_Thesis_Alfonsi_Tommaso.pdf
accessibile in internet solo dagli utenti autorizzati
Descrizione: Alfonsi Tommaso PhD Thesis
Dimensione
17.39 MB
Formato
Adobe PDF
|
17.39 MB | Adobe PDF | Visualizza/Apri |
I documenti in POLITesi sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/10589/220394