de
en
Schliessen
Detailsuche
Bibliotheken
Projekt
Impressum
Datenschutz
zum Inhalt
Detailsuche
Schnellsuche:
OK
Ergebnisliste
Titel
Titel
Inhalt
Inhalt
Seite
Seite
Im Dokument suchen
Huang, Liren: Cloud-based Bioinformatics Framework for Next-Generation Sequencing Data. 2019
Inhalt
Titlepage
Abstract
Acknowledgement
1 Introduction
1.1 The big data challenge in life science
1.2 Distributed cloud computing
1.3 Thesis structure
2 Related Work
2.1 The Apache Hadoop and Spark frameworks
2.1.1 Cluster topology
2.1.2 Spark data processing paradigm
2.1.3 Sorting in Spark
2.2 Sequence alignment and its cloud implementations
2.2.1 Short read alignment and fragment recruitment
2.2.2 Algorithms for sequence alignment
2.2.3 Distributed implementations
2.3 De novo assembly and its cloud implementations
2.3.1 Algorithms for short read de novo assembly
2.3.2 State-of-the-art de Bruijn graph
2.3.3 Cloud based de novo assemblers
2.4 Conclusion
3 Sparkhit: Distributed sequence alignment
3.1 The pipeline for sequence alignment
3.1.1 Building reference index
3.1.2 Candidate block seraching and q-Gram filters
3.1.3 Pigeonhole principle
3.1.4 Banded alignment
3.2 Distributed implementation
3.2.1 Reference index serialization and broadcasting
3.2.2 Data representation in the Spark RDD
3.2.3 Concurrent in memory searching
3.2.4 Memory tuning for Spark native implementation
3.3 Using external tools and Docker containers
3.4 Integrating Spark's machine learning library (MLlib)
3.5 Parallel data preprocessing
3.6 Results and Discussion
3.6.1 Run time comparison between different mappers
3.6.2 Scaling performance of Sparkhit-recruiter
3.6.3 Accuracy and sensitivity of natively implemented tools
3.6.4 Fragment recruitment comparison with MetaSpark
3.6.5 Preprocessing comparison with Crossbow
3.6.6 Machine learning library benchmarking and run time performances on different clusters
3.6.7 Cluster configurations for the benchmarks
3.6.8 NGS data sets for the benchmarks
3.6.9 Discussion
4 Reflexiv: Parallel De Novo genome assembly
4.1 Reflexible Distributed K-mer (RDK)
4.2 Random k-mer reflecting and recursion
4.3 Distributed implementation
4.4 Repeat detection and bubble popping
4.5 The assembly pipeline
4.6 Time complexity
4.7 Memory consumption
4.8 Results and Discussion
4.8.1 Results
4.8.2 Discussion
5 Large scale genomic data analyses
5.1 Cluster deployment and configuration
5.2 Data storage and accessibility
5.3 Distributed data downloading and decompression
5.4 Rapid NGS data analyses on the AWS cloud
5.4.1 Processing all WGS data of the Human Microbiome Project
5.4.2 Genotyping on 3000 samples of the 3000 Rice Genomes Project
5.4.3 Mapping 106 samples of the 1000 Genomes Project
5.4.4 Gene expression profiling on prostate cancer RNA-seq data
5.5 Metagenomic profiling and functional analysis
5.6 Discussion
6 Conclusion and outlook
6.1 Conclusion
6.2 Outlook
Bibliography
Colophon
Declaration