Optimized distributed systems achieve significant performance improvement on sorted merging of massive VCF files

GigaScience
Xiaobo SunCAAPA consortium

Abstract

Sorted merging of genomic data is a common data operation necessary in many sequencing-based studies. It involves sorting and merging genomic data from different subjects by their genomic locations. In particular, merging a large number of variant call format (VCF) files is frequently required in large-scale whole-genome sequencing or whole-exome sequencing projects. Traditional single-machine based methods become increasingly inefficient when processing large numbers of files due to the excessive computation time and Input/Output bottleneck. Distributed systems and more recent cloud-based systems offer an attractive solution. However, carefully designed and optimized workflow patterns and execution plans (schemas) are required to take full advantage of the increased computing power while overcoming bottlenecks to achieve high performance. In this study, we custom-design optimized schemas for three Apache big data platforms, Hadoop (MapReduce), HBase, and Spark, to perform sorted merging of a large number of VCF files. These schemas all adopt the divide-and-conquer strategy to split the merging job into sequential phases/stages consisting of subtasks that are conquered in an ordered, parallel, and bottleneck-free way. In two il...Continue Reading

References

Apr 10, 2009·Bioinformatics·Michael C Schatz
Nov 26, 2009·Genome Biology·Ben LangmeadSteven L Salzberg
Oct 29, 2010·Nature·Gonçalo R AbecasisGil A McVean
Jun 10, 2011·Bioinformatics·Petr DanecekUNKNOWN 1000 Genomes Project Analysis Group
Jun 24, 2011·Bioinformatics·Luca PiredduGianluigi Zanetti
Feb 4, 2012·Bioinformatics·Matti NiemenmaaKeijo Heljanko
Sep 26, 2014·BioMed Research International·Ivan MerelliDaniele D'Agostino
Jan 15, 2015·Journal of Medical Genetics·Min HeKai Wang
Mar 31, 2015·Bioinformatics·Dries DecapJan Fostier
Dec 15, 2015·BMC Genomics·Aidan R O'BrienDenis C Bauer
Oct 12, 2016·Nature Communications·Rasika Ann MathiasKathleen C Barnes

❮ Previous
Next ❯

Citations


❮ Previous
Next ❯

Related Concepts

Related Feeds

22q11 Deletion Syndrome

22q11.2 deletion syndrome, also known as DiGeorge syndrome, is a congenital disorder caused by a partial deletion of chromosome 22. Symptoms include heart defects, poor immune system function, a cleft palate, complications related to low levels of calcium in the blood, and delayed development. Discover the latest research on this disease here.