Effect of lossy compression of quality scores on variant calling

BioRxiv : the Preprint Server for Biology
Idoia OchoaEuan Ashley

Abstract

Recent advancements in sequencing technology have led to a drastic reduction in the cost of genome sequencing. This development has generated an unprecedented amount of genomic data that must be stored, processed, and communicated. To facilitate this effort, compression of genomic files has been proposed. Specifically, lossy compression of quality scores is emerging as a natural candidate for reducing the growing costs of storage. A main goal of performing DNA sequencing in population studies and clinical settings is to identify genetic variation. Though the field agrees that smaller files are advantageous, the cost of lossy compression, in terms of variant discovery, is unclear. Bioinformatic algorithms to identify SNPs and INDELs from next-generation DNA sequencing data use base quality score information; here, we evaluate the effect of lossy compression of quality scores on SNP and INDEL detection. We analyze several lossy compressors introduced recently in the literature. Specifically, we investigate how the output of the variant caller when using the original data (uncompressed) differs from that obtained when quality scores are replaced by those generated by a lossy compressor. Using gold standard genomic datasets such as...Continue Reading

Related Concepts

Lossy Compression
Genome
Nucleic Acid Sequencing
Bio-Informatics
Human DNA Sequencing
Evaluation
URL Data Type
Genomics
Sequencing
Massively-Parallel Sequencing

Related Feeds

BioRxiv & MedRxiv Preprints

BioRxiv and MedRxiv are the preprint servers for biology and health sciences respectively, operated by Cold Spring Harbor Laboratory. Here are the latest preprint articles (which are not peer-reviewed) from BioRxiv and MedRxiv.

Bioinformatics in Biomedicine (Preprints)

Bioinformatics in biomedicine incorporates computer science, biology, chemistry, medicine, mathematics and statistics. Discover the latest preprints on bioinformatics in biomedicine here.

CZI Human Cell Atlas Seed Network

The aim of the Human Cell Atlas (HCA) is to build reference maps of all human cells in order to enhance our understanding of health and disease. The Seed Networks for the HCA project aims to bring together collaborators with different areas of expertise in order to facilitate the development of the HCA. Find the latest research from members of the HCA Seed Networks here.

Related Papers

Bioinformatics
Greg MalysaTsachy Weissman
BioRxiv : the Preprint Server for Biology
?ukasz RoguskiSebastian Deorowicz
© 2020 Meta ULC. All rights reserved