Mash: fast genome and metagenome distance estimation using MinHash
Abstract
Mash extends the MinHash dimensionality-reduction technique to include a pairwise mutation distance and P value significance test, enabling the efficient clustering and search of massive sequence collections. Mash reduces large sequences and sequence sets to small, representative sketches, from which global mutation distances can be rapidly estimated. We demonstrate several use cases, including the clustering of all 54,118 NCBI RefSeq genomes in 33 CPU h; real-time database search using assembled or unassembled Illumina, Pacific Biosciences, and Oxford Nanopore data; and the scalable clustering of hundreds of metagenomic samples by composition. Mash is freely released under a BSD license ( https://github.com/marbl/mash ).
References
Citations
ClermonTyping: an easy-to-use and accurate in silico method for Escherichia genus strain phylotyping
High contiguity genome sequence of a multidrug-resistant hospital isolate of Enterobacter hormaechei
Genomic diversity affects the accuracy of bacterial single-nucleotide polymorphism-calling pipelines
SNP-IT Tool for Identifying Subspecies and Associated Lineages of Mycobacterium tuberculosis Complex
Characterizing transcriptional regulatory sequences in coronaviruses and their role in recombination
Related Concepts
Trending Feeds
COVID-19
Coronaviruses encompass a large family of viruses that cause the common cold as well as more serious diseases, such as the ongoing outbreak of coronavirus disease 2019 (COVID-19; formally known as 2019-nCoV). Coronaviruses can spread from animals to humans; symptoms include fever, cough, shortness of breath, and breathing difficulties; in more severe cases, infection can lead to death. This feed covers recent research on COVID-19.
Synthetic Genetic Array Analysis
Synthetic genetic arrays allow the systematic examination of genetic interactions. Here is the latest research focusing on synthetic genetic arrays and their analyses.
Neural Activity: Imaging
Imaging of neural activity in vivo has developed rapidly recently with the advancement of fluorescence microscopy, including new applications using miniaturized microscopes (miniscopes). This feed follows the progress in this growing field.
Computational Methods for Protein Structures
Computational methods employing machine learning algorithms are powerful tools that can be used to predict the effect of mutations on protein structure. This is important in neurodegenerative disorders, where some mutations can cause the formation of toxic protein aggregations. This feed follows the latests insights into the relationships between mutation and protein structure leading to better understanding of disease.
Congenital Hyperinsulinism
Congenital hyperinsulinism is caused by genetic mutations resulting in excess insulin secretion from beta cells of the pancreas. Here is the latest research.
Chronic Fatigue Syndrome
Chronic fatigue syndrome is a disease characterized by unexplained disabling fatigue; the pathology of which is incompletely understood. Discover the latest research on chronic fatigue syndrome here.
Epigenetic Memory
Epigenetic memory refers to the heritable genetic changes that are not explained by the DNA sequence. Find the latest research on epigenetic memory here.
Cell Atlas of the Human Eye
Constructing a cell atlas of the human eye will require transcriptomic and histologic analysis over the lifespan. This understanding will aid in the study of development and disease. Find the latest research pertaining to the Cell Atlas of the Human Eye here.
Femoral Neoplasms
Femoral Neoplasms are bone tumors that arise in the femur. Discover the latest research on femoral neoplasms here.