An efficient classification algorithm for NGS data based on text similarity

Genetics Research
Xiangyu LiaoXing Chen

Abstract

With the advancement of high-throughput sequencing technologies, the amount of available sequencing data is growing at a pace that has now begun to greatly challenge the data processing and storage capacities of modern computer systems. Removing redundancy from such data by clustering could be crucial for reducing memory, disk space and running time consumption. In addition, it also has good performance on reducing dataset noise in some analysis applications. In this study, we propose a high-performance short sequence classification algorithm (HSC) for next generation sequencing (NGS) data based on efficient hash function and text similarity. First, HSC converts all reads into k-mers, then it forms a unique k-mer set by merging the duplicated and reverse complementary elements. Second, all unique k-mers are stored in a hash table, where the k-mer string is stored in the key field, and the ID of the reads containing the k-mer are stored in the value field. Third, each hash unit is transformed into a short text consisting of reads. Fourth, texts that satisfy the similarity threshold are combined into a long text, the merge operation is executed iteratively until there is no text that satisfies the merge condition. Finally, the lo...Continue Reading

References

May 14, 2008·Nature Reviews. Microbiology·Duccio MediniRino Rappuoli
Aug 17, 2010·Bioinformatics·Robert C Edgar
Dec 1, 2010·Genome Biology·David R KelleySteven L Salzberg
Dec 29, 2010·Proceedings of the National Academy of Sciences of the United States of America·Sante GnerreDavid B Jaffe
Jan 11, 2011·Bioinformatics·Guillaume Marçais, Carl Kingsford
Jul 2, 2011·BMC Bioinformatics·Mohammadreza GhodsiMihai Pop
Aug 4, 2011·Bioinformatics·Ergude BaoThomas Girke
Mar 6, 2012·Nature Methods·Ben Langmead, Steven L Salzberg
Jul 10, 2012·Briefings in Bioinformatics·Weizhong LiJohn Wooley
Jan 18, 2013·Bioinformatics·Guillaume RizkRayan Chikhi
Feb 21, 2013·Bioinformatics·Alexey GurevichGlenn Tesler
Dec 3, 2014·Methods : a Companion to Methods in Enzymology·Yihwan KimMina Rho
Dec 20, 2014·International Journal of Methods in Psychiatric Research·Francisca Galindo-GarreJuana Gómez-Benito
Jan 23, 2015·Bioinformatics·Sebastian DeorowiczAgnieszka Debudaj-Grabysz
Oct 25, 2016·IEEE Transactions on Image Processing : a Publication of the IEEE Signal Processing Society·Li LiuJungong Han
Nov 9, 2017·Journal of Bioinformatics and Computational Biology·Mohammad Arifur RahmanDaniel Barbara
Dec 31, 2017·Computers in Biology and Medicine·Marine BruneauChristophe Guyeux

❮ Previous
Next ❯

Citations

Feb 7, 2020·International Journal of Molecular Sciences·Valery V PanyukovOlga N Ozoline

❮ Previous
Next ❯

Related Concepts

Related Feeds

Blood And Marrow Transplantation

The use of hematopoietic stem cell transplantation or blood and marrow transplantation (bmt) is on the increase worldwide. BMT is used to replace damaged or destroyed bone marrow with healthy bone marrow stem cells. Here is the latest research on bone and marrow transplantation.

Related Papers

Algorithms for Molecular Biology : AMB
Matteo CominMichele Schimd
IEEE/ACM Transactions on Computational Biology and Bioinformatics
Xingyu LiaoJianxin Wang
© 2022 Meta ULC. All rights reserved