These are not the k-mers you are looking for: efficient online k-mer counting using a probabilistic data structure

PloS One
Qingpeng ZhangC Titus Brown

Abstract

K-mer abundance analysis is widely used for many purposes in nucleotide sequence analysis, including data preprocessing for de novo assembly, repeat detection, and sequencing coverage estimation. We present the khmer software package for fast and memory efficient online counting of k-mers in sequencing data sets. Unlike previous methods based on data structures such as hash tables, suffix arrays, and trie structures, khmer relies entirely on a simple probabilistic data structure, a Count-Min Sketch. The Count-Min Sketch permits online updating and retrieval of k-mer counts in memory which is necessary to support online k-mer analysis algorithms. On sparse data sets this data structure is considerably more memory efficient than any exact data structure. In exchange, the use of a Count-Min Sketch introduces a systematic overcount for k-mers; moreover, only the counts, and not the k-mers, are stored. Here we analyze the speed, the memory usage, and the miscount rate of khmer for generating k-mer frequency distributions and retrieving k-mer counts for individual k-mers. We also compare the performance of khmer to several other k-mer counting packages, including Tallymer, Jellyfish, BFCounter, DSK, KMC, Turtle and KAnalyze. Finally,...Continue Reading

References

Feb 9, 2016·Nature Biotechnology·Brad Solomon, Carl Kingsford
Mar 17, 2016·Journal of Computational Biology : a Journal of Computational Molecular Cell Biology·Nelson PérezNelson Vera
Dec 13, 2016·Journal of Open Research Software·Michael R Crusoe, C Titus Brown
Oct 17, 2017·Bioinformatics·Roye RozovRon Shamir
Feb 15, 2018·Bioinformatics·Prashant PandeyBonnie Berger
Apr 17, 2018·Journal of Computational Biology : a Journal of Computational Molecular Cell Biology·Victoria PopicSerafim Batzoglou
Nov 5, 2015·F1000Research·Michael R CrusoeC Titus Brown
Jan 1, 2020·Journal of Computational Biology : a Journal of Computational Molecular Cell Biology·Leonardo PellegrinaFabio Vandin
Nov 1, 2016·Bioinformatics·Daniel MaplesonBernardo J Clavijo
Apr 12, 2018·Journal of Computational Biology : a Journal of Computational Molecular Cell Biology·Brad Solomon, Carl Kingsford
Dec 12, 2018·Microbiology Resource Announcements·Angelo Joshua VictoriaClaudio Donati
Mar 6, 2020·Nature Methods·Can KockanS Cenk Sahinalp
Sep 19, 2019·Virus Evolution·Maha MaabarJoseph Hughes
Dec 12, 2019·The Plant Journal : for Cell and Molecular Biology·Sebastian BeierThomas Schmutzer
Mar 18, 2019·Microbiome·Will Pm RoweMartyn D Winn
Sep 6, 2017·PLoS Computational Biology·Kevin D MurrayNorman Warthmann
Nov 25, 2018·International Journal of Molecular Sciences·Wolfgang KaisersHeiner Schaal

Citations

Aug 16, 2001·Proceedings of the National Academy of Sciences of the United States of America·Pavel A PevznerMichael S Waterman
Aug 7, 2003·Genome Research·Xiaoman Li, Michael S Waterman
Mar 20, 2008·Genome Research·Daniel R Zerbino, Ewan Birney
Aug 9, 2008·Science·Weixing ShenD James Surmeier
May 29, 2009·BMC Bioinformatics·Weijun LuoPeter J Woolf
Dec 10, 2009·Nature Reviews. Genetics·Michael L Metzker
Dec 1, 2010·Genome Biology·David R KelleySteven L Salzberg
Jan 11, 2011·Bioinformatics·Guillaume Marçais, Carl Kingsford
Jan 20, 2011·Bioinformatics·Thomas C Conway, Andrew J Bromage
Mar 23, 2011·Journal of Pediatric Urology·Robert S Van Howe, Michelle R Storms
Jun 21, 2011·Bioinformatics·Paul MedvedevPavel A Pevzner
Aug 13, 2011·BMC Bioinformatics·Páll Melsted, Jonathan K Pritchard
Sep 20, 2011·Nature Biotechnology·Hamidreza ChitsazRoger S Lasken
Aug 1, 2012·Proceedings of the National Academy of Sciences of the United States of America·Jason PellC Titus Brown
Aug 21, 2012·Nucleic Acids Research·Daniel C JonesMichael G Katze
Jan 18, 2013·Bioinformatics·Guillaume RizkRayan Chikhi
May 18, 2013·BMC Bioinformatics·Sebastian DeorowiczSzymon Grabowski
Jun 5, 2013·Bioinformatics·Rayan Chikhi, Paul Medvedev
Mar 13, 2014·Bioinformatics·Rajat Shuvro RoyAlexander Schliep
Mar 19, 2014·Proceedings of the National Academy of Sciences of the United States of America·Adina HoweC Titus Brown
Mar 20, 2014·Bioinformatics·Peter Audano, Fredrik Vannberg
May 16, 2014·Bioinformatics·Paul MüllerThomas Weidemann
Jul 11, 2014·PLoS Computational Biology·Fredrik H KarlssonJens Nielsen
Nov 20, 2014·BMC Bioinformatics·Jamison M McCorrisonBarbara A Methé

Related Concepts

Computer Software
Severe Acute Respiratory Syndrome
Jellyfish
Lossy Compression
Genome
Replication Licensing
Metagenome
Genome Assembly Sequence
Nucleic Acid Sequencing
Bio-Informatics

Trending Feeds

COVID-19

Coronaviruses encompass a large family of viruses that cause the common cold as well as more serious diseases, such as the ongoing outbreak of coronavirus disease 2019 (COVID-19; formally known as 2019-nCoV). Coronaviruses can spread from animals to humans; symptoms include fever, cough, shortness of breath, and breathing difficulties; in more severe cases, infection can lead to death. This feed covers recent research on COVID-19.

Sexual Dimorphism in Neurodegeneration

There exist sex differences in neurodevelopmental and neurodegenerative disorders. For instance, multiple sclerosis is more common in women, whereas Parkinson’s disease is more common in men. Here is the latest research on sexual dimorphism in neurodegeneration

HLA Genetic Variation

HLA genetic variation has been found to confer risk for a wide variety of diseases. Identifying these associations and understanding their molecular mechanisms is ongoing and holds promise for the development of therapeutics. Find the latest research on HLA genetic variation here.

Super-resolution Microscopy

Super-resolution microscopy is the term commonly given to fluorescence microscopy techniques with resolutions that are not limited by the diffraction of light. Here are the latest discoveries pertaining to super-resolution microscopy.

Genetic Screens in iPSC-derived Brain Cells

Genetic screening is a critical tool that can be employed to define and understand gene function and interaction. This feed focuses on genetic screens conducted using induced pluripotent stem cell (iPSC)-derived brain cells.

Brain Lower Grade Glioma

Low grade gliomas in the brain form from oligodendrocytes and astrocytes and are the slowest-growing glioma in adults. Discover the latest research on these brain tumors here.

CD4/CD8 Signaling

Cluster of differentiation 4 and 8 (CD8 and CD8) are glycoproteins founds on the surface of immune cells. Here is the latest research on their role in cell signaling pathways.

Alignment-free Sequence Analysis Tools

Alignment-free sequence analyses have been applied to problems ranging from whole-genome phylogeny to the classification of protein families, identification of horizontally transferred genes, and detection of recombined sequences. Here is the latest research.

Chronic Fatigue Syndrome

Chronic fatigue syndrome is a disease characterized by unexplained disabling fatigue; the pathology of which is incompletely understood. Discover the latest research on chronic fatigue syndrome here.

Related Papers

Proceedings of the National Academy of Sciences of the United States of America
Jason PellC Titus Brown
Algorithms for Molecular Biology : AMB
Kamil SalikhovGregory Kucherov
Nature Methods
Michael Eisenstein
© 2020 Meta ULC. All rights reserved