Large-scale compression of genomic sequence databases with the Burrows-Wheeler transform

Bioinformatics
Anthony J CoxGiovanna Rosone

Abstract

The Burrows-Wheeler transform (BWT) is the foundation of many algorithms for compression and indexing of text data, but the cost of computing the BWT of very large string collections has prevented these techniques from being widely applied to the large sets of sequences often encountered as the outcome of DNA sequencing experiments. In previous work, we presented a novel algorithm that allows the BWT of human genome scale data to be computed on very moderate hardware, thus enabling us to investigate the BWT as a tool for the compression of such datasets. We first used simulated reads to explore the relationship between the level of compression and the error rate, the length of the reads and the level of sampling of the underlying genome and compare choices of second-stage compression algorithm. We demonstrate that compression may be greatly improved by a particular reordering of the sequences in the collection and give a novel 'implicit sorting' strategy that enables these benefits to be realized without the overhead of sorting the reads. With these techniques, a 45× coverage of real human genome sequence data compresses losslessly to under 0.5 bits per base, allowing the 135.3 Gb of sequence to fit into only 8.2 GB of space (t...Continue Reading

References

Jan 1, 1996·Biochimie·E RivalsO Delgrange
Dec 20, 2002·Bioinformatics·Xin ChenJohn Tromp
Mar 3, 2009·Bioinformatics·Raffaele GiancarloFilippo Utro
May 20, 2009·Bioinformatics·Heng Li, Richard Durbin
Jun 10, 2010·Bioinformatics·Jared T Simpson, Richard Durbin
Jul 8, 2010·Bioinformatics·Waibhav TembeEdward Suh
Jan 20, 2011·Genome Research·Markus Hsi-Yang FritzEwan Birney
Jan 22, 2011·Bioinformatics·Sebastian Deorowicz, Szymon Grabowski
Mar 10, 2011·Journal of Computational Biology : a Journal of Computational Molecular Cell Biology·Christos KozanitisGeorge Varghese
Oct 13, 2011·Algorithms for Molecular Biology : AMB·Vladimir Yanovsky
Dec 14, 2011·Genome Research·Jared T Simpson, Richard Durbin

❮ Previous
Next ❯

Citations

Jun 12, 2013·Bioinformatics·Christos KozanitisVineet Bafna
May 11, 2013·Bioinformatics·Lilian JaninAnthony J Cox
Aug 24, 2013·Bioinformatics·Sebastian DeorowiczSzymon Grabowski
Oct 18, 2013·Bioinformatics·Armando J Pinho, Diogo Pratas
Aug 21, 2012·Nucleic Acids Research·Daniel C JonesMichael G Katze
Oct 16, 2012·Nucleic Acids Research·Niko Popitsch, Arndt von Haeseler
Nov 30, 2012·Biology Direct·Lin DaiZhang Zhang
Nov 21, 2013·Algorithms for Molecular Biology : AMB·Sebastian Deorowicz, Szymon Grabowski
Jan 18, 2014·BMC Research Notes·Diogo PratasJoão M O S Rodrigues
Dec 19, 2013·Briefings in Bioinformatics·Raffaele GiancarloFilippo Utro
Aug 2, 2013·Briefings in Bioinformatics·Oliver Bonham-CarterDhundy Bastola
Aug 12, 2014·Bioinformatics·Heng Li
Aug 31, 2014·Bioinformatics·James Holt, Leonard McMillan
Oct 31, 2014·Nature Methods·Faraz HachS Cenk Sahinalp
Jun 22, 2014·Bioinformatics·Lilian JaninAnthony J Cox
Dec 5, 2013·Briefings in Bioinformatics·Zexuan ZhuXiao Yang
Feb 6, 2016·Journal of Bioinformatics and Computational Biology·Muhammad SardarazAtaul Aziz Ikram
Jun 21, 2015·Bioinformatics·Marius NicolaeSanguthevar Rajasekaran
Apr 26, 2015·Bioinformatics·Rob Patro, Carl Kingsford
Apr 19, 2016·Algorithms for Molecular Biology : AMB·Guillaume HolleyJens Stoye
Feb 5, 2015·Bioinformatics·Carl Kingsford, Rob Patro
Jan 23, 2015·Bioinformatics·Kouichi Kimura, Asako Koike
Dec 30, 2014·Bioinformatics·Szymon GrabowskiŁukasz Roguski
Jun 30, 2016·Bioinformatics·Daniel L GreenfieldAlban Rrustemi
Nov 1, 2016·Nature Methods·Ibrahim NumanagićS Cenk Sahinalp
May 13, 2017·Bioinformatics·Ferdinando MontecuolloRoberto Tagliaferri
Feb 15, 2018·Bioinformatics·Shubham ChandakTsachy Weissman
Apr 5, 2018·Bioinformatics·Lukasz RoguskiSebastian Deorowicz
Mar 31, 2015·PloS One·Luís M O MatosArmando J Pinho
Feb 10, 2018·Nature Communications·Antonio A GinartDavid N Tse
Jan 2, 2020·Bioinformatics·Tomasz M Kowalski, Szymon Grabowski
Jan 19, 2020·Scientific Reports·Sebastian Deorowicz
Sep 18, 2020·BMC Bioinformatics·Veronica GuerriniGiovanna Rosone
May 6, 2019·Bioinformatics·Sudipta Pathak, Sanguthevar Rajasekaran
Sep 18, 2020·BMC Bioinformatics·Nicola PrezzaGiovanna Rosone
Dec 18, 2020·Genome Research·Camille MarchetRayan Chikhi

❮ Previous
Next ❯

Related Concepts

Trending Feeds

COVID-19

Coronaviruses encompass a large family of viruses that cause the common cold as well as more serious diseases, such as the ongoing outbreak of coronavirus disease 2019 (COVID-19; formally known as 2019-nCoV). Coronaviruses can spread from animals to humans; symptoms include fever, cough, shortness of breath, and breathing difficulties; in more severe cases, infection can lead to death. This feed covers recent research on COVID-19.

Blastomycosis

Blastomycosis fungal infections spread through inhaling Blastomyces dermatitidis spores. Discover the latest research on blastomycosis fungal infections here.

Nuclear Pore Complex in ALS/FTD

Alterations in nucleocytoplasmic transport, controlled by the nuclear pore complex, may be involved in the pathomechanism underlying multiple neurodegenerative diseases including Amyotrophic Lateral Sclerosis and Frontotemporal Dementia. Here is the latest research on the nuclear pore complex in ALS and FTD.

Applications of Molecular Barcoding

The concept of molecular barcoding is that each original DNA or RNA molecule is attached to a unique sequence barcode. Sequence reads having different barcodes represent different original molecules, while sequence reads having the same barcode are results of PCR duplication from one original molecule. Discover the latest research on molecular barcoding here.

Chronic Fatigue Syndrome

Chronic fatigue syndrome is a disease characterized by unexplained disabling fatigue; the pathology of which is incompletely understood. Discover the latest research on chronic fatigue syndrome here.

Evolution of Pluripotency

Pluripotency refers to the ability of a cell to develop into three primary germ cell layers of the embryo. This feed focuses on the mechanisms that underlie the evolution of pluripotency. Here is the latest research.

Position Effect Variegation

Position Effect Variagation occurs when a gene is inactivated due to its positioning near heterochromatic regions within a chromosome. Discover the latest research on Position Effect Variagation here.

STING Receptor Agonists

Stimulator of IFN genes (STING) are a group of transmembrane proteins that are involved in the induction of type I interferon that is important in the innate immune response. The stimulation of STING has been an active area of research in the treatment of cancer and infectious diseases. Here is the latest research on STING receptor agonists.

Microbicide

Microbicides are products that can be applied to vaginal or rectal mucosal surfaces with the goal of preventing, or at least significantly reducing, the transmission of sexually transmitted infections. Here is the latest research on microbicides.