Oct 24, 2018

MeShClust2: Application of alignment-free identity scores in clustering long DNA sequences

BioRxiv : the Preprint Server for Biology
Benjamin T. James, Hani Z. Girgis

Abstract

Grouping sequences into similar clusters is an important part of sequence analysis. Widely used clustering tools sacrifice quality for speed. Previously, we developed MeShClust, which utilizes k-mer counts in an alignment-assisted classifier and the mean-shift algorithm for clustering DNA sequences. Although MeShClust outperformed related tools in terms of cluster quality, the alignment algorithm used for generating training data for the classifier was not scalable to longer sequences. In contrast, MeShClust2 generates semi-synthetic sequence pairs with known mutation rates, avoiding alignment algorithms. MeShClust2 clustered 3600 bacterial genomes, providing a utility for clustering long sequences using identity scores for the first time.

  • References
  • Citations

References

  • We're still populating references for this paper, please check back later.
  • References
  • Citations

Citations

  • This paper may not have been cited yet.

Mentioned in this Paper

Classification
Genome
Sequence Analysis
Analysis
Gene Clusters
Genome, Bacterial
Mutation Abnormality
DNA
DNA Sequence

About this Paper

Related Feeds

BioRxiv & MedRxiv Preprints

BioRxiv and MedRxiv are the preprint servers for biology and health sciences respectively, operated by Cold Spring Harbor Laboratory. Here are the latest preprint articles (which are not peer-reviewed) from BioRxiv and MedRxiv.