DOI: 10.1101/472399Nov 19, 2018Paper

Prefix-Free Parsing for Building Big BWTs

BioRxiv : the Preprint Server for Biology
Christina BoucherTaher Mun

Abstract

High-throughput sequencing technologies have led to explosive growth of genomic databases; one of which will soon reach hundreds of terabytes. For many applications we want to build and store indexes of these databases but constructing such indexes is a challenge. Fortunately, many of these genomic databases are highly-repetitive\---|a characteristic that can be exploited to ease the computation of the Burrows-Wheeler Transform (BWT), which underlies many popular indexes. In this paper, we introduce a preprocessing algorithm, referred to as prefix-free parsing, that takes a text T as input, and in one-pass generates a dictionary D and a parse P of T with the property that the BWT of T can be constructed from D and P using workspace proportional to their total size and O(|T|)-time. Our experiments show that D and P are significantly smaller than T in practice, and thus, can fit in a reasonable internal memory even when T is very large. In particular, we show that with prefix-free parsing we can build an 131-megabyte run-length compressed FM-index (restricted to support only counting and not locating) for 1000 copies of human chromosome 19 in 2 hours using 21 gigabytes of memory, suggesting that we can build a 6.73 gigabyte index...Continue Reading

Related Concepts

Chromosomes
Genome, Human
Memory
Mandibular Right Second Primary Molar
Hemoglobin D Disease
Sodium iodide I131
Size
Research Study
Computed (Procedure)
Act Relationship Type - Transformation

Related Feeds

BioRxiv & MedRxiv Preprints

BioRxiv and MedRxiv are the preprint servers for biology and health sciences respectively, operated by Cold Spring Harbor Laboratory. Here are the latest preprint articles (which are not peer-reviewed) from BioRxiv and MedRxiv.

Related Papers

BioRxiv : the Preprint Server for Biology
Alan KuhnleGiovanni Manzini
Philosophical Transactions. Series A, Mathematical, Physical, and Engineering Sciences
H FerradaS J Puglisi
© 2021 Meta ULC. All rights reserved