Recent advances in Next Generation Sequencing (NGS) technologies have enabled for genome-wide measurements of DNA - associated proteins using a combination of chromatin immunoprecipitation and sequencing (ChIP-Seq). With the decreasing cost of sequencing, ChIP-Seq has been widely used for genomic assay research to determine binding sites for Transcription Factor (TF)s or enrichment for histone modifications of interest. A precise map of binding sites for TF, and DNA-binding proteins are vital for deciphering gene regulatory networks that underlie various biological processes. Moreover, identifying protein binding sites from large, sequence based datasets presents a bioinformatics challenge that has required novel, sophisticated computational methods and algorithms. Although Numerous software tools have been introduced and attempt to reveal true binding site events with higher precision, but fewer algorithms have achieved the desired result because nature of NGS data is quite complex and possibly comprise the higher amount of artifact (i.e., noise). However, recently proposed computational method, multiple sample peak calling (MSPC) that simultaneously evaluate multiple ChIP-Seq replicates for TF binding sites and rigorously combines local evidence of Enriched Region (ER)s with Fisher method, which increases statistical significance of detected ERs in enrichment analysis output. This thesis concerns development of novel R/Bioconductor package aims to facilitate downstream analysis of NGS data. The package provides function for data import, quality assessment pipeline for genomic interval overlapping, post-processing of enrichment analysis, and data exploration. The implementation of MSPC package has allowed for taking advantage of parallel processing, jointly analyzes the ERs of multiple ChIP-Seq samples, and render newly discovered ER list considering the combined local evidence of ERs. Finally, MSPC package has been tested with Myc TF public datasets in K562 human cells available from ENCODE project, the result was validated with verified software tool MuSERA under same parameter setting, whereas accuracy reaches 96%.
Recent advances in Next Generation Sequencing (NGS) technologies have enabled for genome-wide measurements of DNA - associated proteins using a combination of chromatin immunoprecipitation and sequencing (ChIP-Seq). With the decreasing cost of sequencing, ChIP-Seq has been widely used for genomic assay research to determine binding sites for Transcription Factor (TF)s or enrichment for histone modifications of interest. A precise map of binding sites for TF, and DNA-binding proteins are vital for deciphering gene regulatory networks that underlie various biological processes. Moreover, identifying protein binding sites from large, sequence based datasets presents a bioinformatics challenge that has required novel, sophisticated computational methods and algorithms. Although Numerous software tools have been introduced and attempt to reveal true binding site events with higher precision, but fewer algorithms have achieved the desired result because nature of NGS data is quite complex and possibly comprise the higher amount of artifact (i.e., noise). However, recently proposed computational method, multiple sample peak calling (MSPC) that simultaneously evaluate multiple ChIP-Seq replicates for TF binding sites and rigorously combines local evidence of Enriched Region (ER)s with Fisher method, which increases statistical significance of detected ERs in enrichment analysis output. This thesis concerns development of novel R/Bioconductor package aims to facilitate downstream analysis of NGS data. The package provides function for data import, quality assessment pipeline for genomic interval overlapping, post-processing of enrichment analysis, and data exploration. The implementation of MSPC package has allowed for taking advantage of parallel processing, jointly analyzes the ERs of multiple ChIP-Seq samples, and render newly discovered ER list considering the combined local evidence of ERs. Finally, MSPC package has been tested with Myc TF public datasets in K562 human cells available from ENCODE project, the result was validated with verified software tool MuSERA under same parameter setting, whereas accuracy reaches 96%.
MSPC : an R/bioconductor package for combined analysis of ChIP-Seq data
SHAYIDING, JULAITI
2015/2016
Abstract
Recent advances in Next Generation Sequencing (NGS) technologies have enabled for genome-wide measurements of DNA - associated proteins using a combination of chromatin immunoprecipitation and sequencing (ChIP-Seq). With the decreasing cost of sequencing, ChIP-Seq has been widely used for genomic assay research to determine binding sites for Transcription Factor (TF)s or enrichment for histone modifications of interest. A precise map of binding sites for TF, and DNA-binding proteins are vital for deciphering gene regulatory networks that underlie various biological processes. Moreover, identifying protein binding sites from large, sequence based datasets presents a bioinformatics challenge that has required novel, sophisticated computational methods and algorithms. Although Numerous software tools have been introduced and attempt to reveal true binding site events with higher precision, but fewer algorithms have achieved the desired result because nature of NGS data is quite complex and possibly comprise the higher amount of artifact (i.e., noise). However, recently proposed computational method, multiple sample peak calling (MSPC) that simultaneously evaluate multiple ChIP-Seq replicates for TF binding sites and rigorously combines local evidence of Enriched Region (ER)s with Fisher method, which increases statistical significance of detected ERs in enrichment analysis output. This thesis concerns development of novel R/Bioconductor package aims to facilitate downstream analysis of NGS data. The package provides function for data import, quality assessment pipeline for genomic interval overlapping, post-processing of enrichment analysis, and data exploration. The implementation of MSPC package has allowed for taking advantage of parallel processing, jointly analyzes the ERs of multiple ChIP-Seq samples, and render newly discovered ER list considering the combined local evidence of ERs. Finally, MSPC package has been tested with Myc TF public datasets in K562 human cells available from ENCODE project, the result was validated with verified software tool MuSERA under same parameter setting, whereas accuracy reaches 96%.| File | Dimensione | Formato | |
|---|---|---|---|
|
2017_04_Shayiding.pdf
accessibile in internet solo dagli utenti autorizzati
Descrizione: Thesis text
Dimensione
1.62 MB
Formato
Adobe PDF
|
1.62 MB | Adobe PDF | Visualizza/Apri |
I documenti in POLITesi sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/10589/132718