DOI: 10.1101/501130Dec 19, 2018Paper

Nucleotide Archival Format (NAF) enables efficient lossless reference-free compression of DNA sequences

BioRxiv : the Preprint Server for Biology
Kirill KryukovTadashi Imanishi

Abstract

Summary: DNA sequence databases use compression such as gzip to reduce the required storage space and network transmission time. We describe Nucleotide Archival Format (NAF) - a new file format for lossless reference-free compression of FASTA and FASTQ-formatted nucleotide sequences. NAF compression ratio is comparable to the best DNA compressors, while providing dramatically faster decompression. We compared our format with DNA compressors: DELIMINATE and MFCompress, and with general purpose compressors: gzip, bzip2, xz, brotli, and zstd. Availability and implementation: NAF compressor and decompressor, as well as format specification are available at https://github.com/KirillKryukov/naf. Format specification is in public domain. Compressor and decompressor are open source under the zlib/libpng license, free for nearly any use. Contact: kkryukov@gmail.com

Related Concepts

Base Sequence
DNA
Nucleotides
DNA Sequence
Anatomical Space Structure
Biological Neural Networks
Public Area

Related Feeds

BioRxiv & MedRxiv Preprints

BioRxiv and MedRxiv are the preprint servers for biology and health sciences respectively, operated by Cold Spring Harbor Laboratory. Here are the latest preprint articles (which are not peer-reviewed) from BioRxiv and MedRxiv.