Compositions are a type of data for which each observation contains only relative information (expressed as proportions or concentrations) for parts making up a whole, whose total (absolute) value is not known. Use of compositional data in statistical genetics is common; however, methods often lack proper treatment of compositions and are unable of providing results equivalent to those obtained when the data is available in absolute, non-compositional form. Here we report the presentation and assessment of a novel method for compositional data, specifically designed to be implemented as part of a Genome-Wide Association Study (GWAS) which investigates genetic impact on white blood cells' relative quantities (termed phenotypes in GWAS terminology). Our first contribution stands in designing a particular set of compositional transformations, based on compositional pivot coordinates, which can be used as responses for a full and informative GWAS. Our second contribution consists in developing a classification framework (based on our designed coordinates) to produce a final response on which cell types would be found significantly associated with genetics, if we had absolute data. The developed method, which we call "Smart pivots method", acts not only as a correction for the employment of compositions, but aims at reducing the entanglement for information related to different compositional parts. Assessment for method performance is carried out both in extensive simulations, and on real data provided by the UK BioBank, allowing for a full GWAS on a large human dataset for both compositional and non-compositional (benchmark) phenotypes. This thesis therefore extends existing literature on both GWAS and compositional data analysis, while providing a solution to compositional issues arising in different application fields.
I dati composizionali sono un tipo di dato in cui ogni osservazione contiene solo informazione relativa (espressa come proporzione o concentrazione) per parti che si riferiscono ad un tutto, il cui valore totale (assoluto) non è noto. L’uso di dati composizionali in statistica genetica è comune; tuttavia, spesso i metodi non trattano adeguatamente le composizioni e non possono garantire risultati equivalenti a quelli ottenuti quando i dati sono disponibili in forma assoluta (non-composizionale). Si riporta la presentazione e valutazione di un nuovo metodo per dati composizionali, specificamente progettato per l’implementazione in un Genome-Wide Association Study (GWAS) che indaghi l’impatto genetico sulle quantità relative di globuli bianchi (chiamati fenotipi nella terminologia GWAS). Il nostro primo contributo consiste nel progettare un particolare insieme di trasformazioni per le composizioni, basate sulle coordinate pivot per composizioni, che possono essere utilizzate come risposta in un GWAS completo ed informativo. Il nostro secondo contributo risiede nello sviluppo di un framework di classificazione (basato sulle coordinate da noi progettate) per produrre un responso finale su quali tipi di cellule verrebbero trovati significativamente associate alla genetica, se avessimo dati in forma assoluta. Il metodo sviluppato, che chiamiamo "Smart pivots method", agisce non solo come correzione per l’utilizzo delle composizioni, ma è diretto a ridurre il "rimescolamento" (entanglement) di informazione relativa a diverse parti composizionali. La valutazione della performance del metodo è svolta tramite una vasta gamma di simulazioni prima, e su dati reali forniti da UK BioBank poi, consentendo un GWAS completo su un esteso dataset umano per fenotipi sia composizionali che non-composizionali (che fungono da benchmark). Questa tesi pertanto estende la letteratura esistente su entrambi GWAS e analisi di dati composizionali, fornendo una soluzione a problemi composizionali potenzialmente riscontrabili in svariati campi di applicazione.
Disentangling compositional data in genomics: a GWAS backbone to single out compositional phenotypes
Cantalini, Costanza
2022/2023
Abstract
Compositions are a type of data for which each observation contains only relative information (expressed as proportions or concentrations) for parts making up a whole, whose total (absolute) value is not known. Use of compositional data in statistical genetics is common; however, methods often lack proper treatment of compositions and are unable of providing results equivalent to those obtained when the data is available in absolute, non-compositional form. Here we report the presentation and assessment of a novel method for compositional data, specifically designed to be implemented as part of a Genome-Wide Association Study (GWAS) which investigates genetic impact on white blood cells' relative quantities (termed phenotypes in GWAS terminology). Our first contribution stands in designing a particular set of compositional transformations, based on compositional pivot coordinates, which can be used as responses for a full and informative GWAS. Our second contribution consists in developing a classification framework (based on our designed coordinates) to produce a final response on which cell types would be found significantly associated with genetics, if we had absolute data. The developed method, which we call "Smart pivots method", acts not only as a correction for the employment of compositions, but aims at reducing the entanglement for information related to different compositional parts. Assessment for method performance is carried out both in extensive simulations, and on real data provided by the UK BioBank, allowing for a full GWAS on a large human dataset for both compositional and non-compositional (benchmark) phenotypes. This thesis therefore extends existing literature on both GWAS and compositional data analysis, while providing a solution to compositional issues arising in different application fields.File | Dimensione | Formato | |
---|---|---|---|
2024_04_Cantalini_ExecutiveSummary_02.pdf
accessibile in internet per tutti
Descrizione: executive summary
Dimensione
582.46 kB
Formato
Adobe PDF
|
582.46 kB | Adobe PDF | Visualizza/Apri |
2024_04_Cantalini_Tesi_01.pdf
accessibile in internet per tutti
Descrizione: testo tesi
Dimensione
8.64 MB
Formato
Adobe PDF
|
8.64 MB | Adobe PDF | Visualizza/Apri |
I documenti in POLITesi sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/10589/217972