Abstract
With the advancement of high-throughput sequencing technologies, the amount of available sequencing data is growing at a pace that has now begun to greatly challenge the data processing and storage capacities of modern computer systems. Removing redundancy from such data by clustering could be crucial for reducing memory, disk space and running time consumption. In addition, it also has good performance on reducing dataset noise in some analysis applications. In this study, we propose a high-performance short sequence classification algorithm (HSC) for next generation sequencing (NGS) data based on efficient hash function and text similarity. First, HSC converts all reads into k-mers, then it forms a unique k-mer set by merging the duplicated and reverse complementary elements. Second, all unique k-mers are stored in a hash table, where the k-mer string is stored in the key field, and the ID of the reads containing the k-mer are stored in the value field. Third, each hash unit is transformed into a short text consisting of reads. Fourth, texts that satisfy the similarity threshold are combined into a long text, the merge operation is executed iteratively until there is no text that satisfies the merge condition. Finally, the lo...Continue Reading
References
May 30, 2006·Bioinformatics·Weizhong Li, Adam Godzik
May 14, 2008·Nature Reviews. Microbiology·Duccio MediniRino Rappuoli
Jan 8, 2010·Bioinformatics·Ying HuangWeizhong Li
Aug 17, 2010·Bioinformatics·Robert C Edgar
Dec 1, 2010·Genome Biology·David R KelleySteven L Salzberg
Dec 29, 2010·Proceedings of the National Academy of Sciences of the United States of America·Sante GnerreDavid B Jaffe
Jan 11, 2011·Bioinformatics·Guillaume Marçais, Carl Kingsford
Jul 2, 2011·BMC Bioinformatics·Mohammadreza GhodsiMihai Pop
Aug 4, 2011·Bioinformatics·Ergude BaoThomas Girke
Mar 6, 2012·Nature Methods·Ben Langmead, Steven L Salzberg
Jul 10, 2012·Briefings in Bioinformatics·Weizhong LiJohn Wooley
Jan 18, 2013·Bioinformatics·Guillaume RizkRayan Chikhi
Feb 21, 2013·Bioinformatics·Alexey GurevichGlenn Tesler
Jan 1, 2012·GigaScience·Ruibang LuoJun Wang
Dec 3, 2014·Methods : a Companion to Methods in Enzymology·Yihwan KimMina Rho
Dec 20, 2014·International Journal of Methods in Psychiatric Research·Francisca Galindo-GarreJuana Gómez-Benito
Jan 23, 2015·Bioinformatics·Sebastian DeorowiczAgnieszka Debudaj-Grabysz
Oct 25, 2016·IEEE Transactions on Image Processing : a Publication of the IEEE Signal Processing Society·Li LiuJungong Han
Nov 9, 2017·Journal of Bioinformatics and Computational Biology·Mohammad Arifur RahmanDaniel Barbara
Dec 31, 2017·Computers in Biology and Medicine·Marine BruneauChristophe Guyeux