Decision Trees and Random Forests are widely used machine learning models valued for their interpretability, transparency, and ability to handle complex data. These models are commonly applied across various fields, including bioinformatics, where their capability to manage high-dimensional, non-linear, and heterogeneous data makes them especially effective for analyzing genomic, transcriptomic, and clinical datasets. However, such models solely focus on the predictive ability of each feature, possibly overlooking their biological significance. Incorporating feature-specific prior biological knowledge, such as functional annotations and pathway information, into tree-based predictive models, can improve their outcomes by leveraging the well-established biological knowledge. In this thesis work, we propose a novel method for integrating prior information into tree-based models, obtained by the exploitation of popular biological knowledge bases. The findings from the performed experiments, especially when compared to the standard tree-based models, suggests that the implemented biology -informed models effectively identify the most relevant features, both in terms of predictive ability and biological relevance.
Gli alberi decisionali e le foreste casuali sono modelli di machine learning ampiamente utilizzati, apprezzati per la loro interpretabilità, trasparenza e capacità di gestire dati complessi. Questi modelli sono comunemente applicati in vari campi, inclusa la bioinformatica, dove la loro abilità di gestire dati ad alta dimensionalità, non lineari ed eterogenei li rende particolarmente efficaci per l'analisi di dataset genomici, trascrittomici e clinici. Tuttavia, tali modelli si concentrano esclusivamente sulla capacità predittiva di ciascuna feature, rischiando di trascurare la loro rilevanza biologica. Integrare conoscenze biologiche pregresse specifiche delle features, come annotazioni funzionali e informazioni sui percorsi biologici, nei modelli predittivi basati su alberi può migliorare i risultati sfruttando il patrimonio di conoscenze biologiche consolidato. In questo lavoro di tesi, proponiamo un metodo innovativo per integrare informazioni pregresse nei modelli basati su alberi, ottenute attraverso l’uso di popolari basi di conoscenza biologica. I risultati degli esperimenti condotti, soprattutto in confronto ai modelli standard basati su alberi, suggeriscono che i modelli informati biologicamente implementati identificano efficacemente le caratteristiche più rilevanti, sia in termini di capacità predittiva che di rilevanza biologica.
Integrating biological prior knowledge into tree-based machine learning models for comprehensive gene analysis
TACCA, MANUEL
2023/2024
Abstract
Decision Trees and Random Forests are widely used machine learning models valued for their interpretability, transparency, and ability to handle complex data. These models are commonly applied across various fields, including bioinformatics, where their capability to manage high-dimensional, non-linear, and heterogeneous data makes them especially effective for analyzing genomic, transcriptomic, and clinical datasets. However, such models solely focus on the predictive ability of each feature, possibly overlooking their biological significance. Incorporating feature-specific prior biological knowledge, such as functional annotations and pathway information, into tree-based predictive models, can improve their outcomes by leveraging the well-established biological knowledge. In this thesis work, we propose a novel method for integrating prior information into tree-based models, obtained by the exploitation of popular biological knowledge bases. The findings from the performed experiments, especially when compared to the standard tree-based models, suggests that the implemented biology -informed models effectively identify the most relevant features, both in terms of predictive ability and biological relevance.File | Dimensione | Formato | |
---|---|---|---|
Article_Format_Thesis_Manuel_Tacca_10707009.pdf
accessibile in internet solo dagli utenti autorizzati
Dimensione
2.1 MB
Formato
Adobe PDF
|
2.1 MB | Adobe PDF | Visualizza/Apri |
Executive_Summary_Manuel_Tacca_10707009.pdf
accessibile in internet solo dagli utenti autorizzati
Dimensione
751.38 kB
Formato
Adobe PDF
|
751.38 kB | Adobe PDF | Visualizza/Apri |
I documenti in POLITesi sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/10589/231466