DOI: 10.1101/451278Oct 24, 2018Paper

MeShClust2: Application of alignment-free identity scores in clustering long DNA sequences

BioRxiv : the Preprint Server for Biology
Benjamin T James, Hani Z Girgis


Grouping sequences into similar clusters is an important part of sequence analysis. Widely used clustering tools sacrifice quality for speed. Previously, we developed MeShClust, which utilizes k-mer counts in an alignment-assisted classifier and the mean-shift algorithm for clustering DNA sequences. Although MeShClust outperformed related tools in terms of cluster quality, the alignment algorithm used for generating training data for the classifier was not scalable to longer sequences. In contrast, MeShClust2 generates semi-synthetic sequence pairs with known mutation rates, avoiding alignment algorithms. MeShClust2 clustered 3600 bacterial genomes, providing a utility for clustering long sequences using identity scores for the first time.

Related Concepts

Related Feeds

BioRxiv & MedRxiv Preprints

BioRxiv and MedRxiv are the preprint servers for biology and health sciences respectively, operated by Cold Spring Harbor Laboratory. Here are the latest preprint articles (which are not peer-reviewed) from BioRxiv and MedRxiv.