Abstract
Sorted merging of genomic data is a common data operation necessary in many sequencing-based studies. It involves sorting and merging genomic data from different subjects by their genomic locations. In particular, merging a large number of variant call format (VCF) files is frequently required in large-scale whole-genome sequencing or whole-exome sequencing projects. Traditional single-machine based methods become increasingly inefficient when processing large numbers of files due to the excessive computation time and Input/Output bottleneck. Distributed systems and more recent cloud-based systems offer an attractive solution. However, carefully designed and optimized workflow patterns and execution plans (schemas) are required to take full advantage of the increased computing power while overcoming bottlenecks to achieve high performance. In this study, we custom-design optimized schemas for three Apache big data platforms, Hadoop (MapReduce), HBase, and Spark, to perform sorted merging of a large number of VCF files. These schemas all adopt the divide-and-conquer strategy to split the merging job into sequential phases/stages consisting of subtasks that are conquered in an ordered, parallel, and bottleneck-free way. In two il...Continue Reading
References
Apr 10, 2009·Bioinformatics·Michael C Schatz
Nov 26, 2009·Genome Biology·Ben LangmeadSteven L Salzberg
Jul 21, 2010·Genome Research·Aaron McKennaMark A DePristo
Oct 29, 2010·Nature·Gonçalo R AbecasisGil A McVean
Jan 7, 2011·Bioinformatics·Heng Li
Apr 21, 2011·Bioinformatics·Beifang NiuWeizhong Li
Jun 10, 2011·Bioinformatics·Petr DanecekUNKNOWN 1000 Genomes Project Analysis Group
Jun 24, 2011·Bioinformatics·Luca PiredduGianluigi Zanetti
Feb 4, 2012·Bioinformatics·Matti NiemenmaaKeijo Heljanko
Dec 4, 2012·Bioinformatics·Hailiang HuangRobert J Prill
May 23, 2014·Bioinformatics·Marek S WiewiórkaMichał J Okoniewski
Aug 30, 2014·Bioinformatics·Oliver S BurrenChris Wallace
Sep 26, 2014·BioMed Research International·Ivan MerelliDaniele D'Agostino
Nov 11, 2014·BioData Mining·Emad A MohammedChristopher Naugler
Jan 15, 2015·Journal of Medical Genetics·Min HeKai Wang
Feb 16, 2015·Bioinformatics·Li ChenHao Wu
Mar 31, 2015·Bioinformatics·Dries DecapJan Fostier
Jun 6, 2015·GigaScience·Alexey SiretskiyOla Spjuth
Dec 15, 2015·BMC Genomics·Aidan R O'BrienDenis C Bauer
Oct 12, 2016·Nature Communications·Rasika Ann MathiasKathleen C Barnes
May 16, 2018·GigaScience·Xiaobo SunUNKNOWN CAAPA consortium