A Sequence-Based Novel Approach for Quality Evaluation of Third-Generation Sequencing Reads

Genes
Wenjing ZhangHong-Dong Li

Abstract

The advent of third-generation sequencing (TGS) technologies, such as the Pacific Biosciences (PacBio) and Oxford Nanopore machines, provides new possibilities for contig assembly, scaffolding, and high-performance computing in bioinformatics due to its long reads. However, the high error rate and poor quality of TGS reads provide new challenges for accurate genome assembly and long-read alignment. Efficient processing methods are in need to prioritize high-quality reads for improving the results of error correction and assembly. In this study, we proposed a novel Read Quality Evaluation and Selection Tool (REQUEST) for evaluating the quality of third-generation long reads. REQUEST generates training data of high-quality and low-quality reads which are characterized by their nucleotide combinations. A linear regression model was built to score the quality of reads. The method was tested on three datasets of different species. The results showed that the top-scored reads prioritized by REQUEST achieved higher alignment accuracies. The contig assembly results based on the top-scored reads also outperformed conventional approaches that use all reads. REQUEST is able to distinguish high-quality reads from low-quality ones without u...Continue Reading

References

Apr 5, 2002·Genome Research·W James Kent
Mar 6, 2009·Genome Biology·Ben LangmeadSteven L Salzberg
Jun 18, 2010·Journal of Bioinformatics and Computational Biology·Irina AbnizovaTony Cox
Nov 26, 2010·Nature Methods·Can AlkanEvan E Eichler
Nov 30, 2011·Nature Reviews. Genetics·Todd J Treangen, Steven L Salzberg
Jul 4, 2012·Nature Biotechnology·Sergey Koren Adam M Phillippy
Jul 20, 2012·Journal of Bioinformatics and Computational Biology·Irina AbnizovaTony Cox
Feb 21, 2013·Bioinformatics·Alexey GurevichGlenn Tesler
Aug 29, 2014·Bioinformatics·Leena Salmela, Eric Rivals
Mar 10, 2015·Nature Methods·Daehwan KimSteven L Salzberg
May 26, 2015·Nature Biotechnology·Konstantin BerlinAdam M Phillippy
Feb 3, 2016·F1000Research·Camilla L C IpMinION Analysis and Reference Consortium
Feb 13, 2016·Briefings in Bioinformatics·Katrin SameithMichael Hiller
Apr 15, 2016·IEEE/ACM Transactions on Computational Biology and Bioinformatics·Min LiYi Pan
Sep 17, 2016·Bioinformatics·Junwei LuoFangxiang Wu
Sep 21, 2016·Genomics, Proteomics & Bioinformatics·Hengyun LuZemin Ning
Oct 18, 2016·Nature Methods·Chen-Shan ChinMichael C Schatz
Nov 11, 2016·BMC Bioinformatics·Ruifeng HuXiaobo Sun
May 27, 2017·Computational Biology and Chemistry·Min LiJianxin Wang
Dec 29, 2017·Genes·Changsheng LiRuidong Huang
May 12, 2018·Bioinformatics·Heng Li
Jul 12, 2018·IEEE/ACM Transactions on Computational Biology and Bioinformatics·Junwei LuoYi Pan
Jul 25, 2018·IEEE/ACM Transactions on Computational Biology and Bioinformatics·Min LiJianxin Wang
Jul 31, 2018·IEEE/ACM Transactions on Computational Biology and Bioinformatics·Xingyu LiaoJianxin Wang
Oct 19, 2018·IEEE/ACM Transactions on Computational Biology and Bioinformatics·Binbin WuJianxin Wang

Datasets Mentioned

BETA
SRR7167956

Related Concepts

Metazoa
Drosophila melanogaster
Alkalescens-Dispar Group
Computer Programs and Programming
Yersinia pestis
Sequence Determinations, DNA
High-Throughput Nucleotide Sequencing
Drug Combinations
Genome
Software Tools

Trending Feeds

COVID-19

Coronaviruses encompass a large family of viruses that cause the common cold as well as more serious diseases, such as the ongoing outbreak of coronavirus disease 2019 (COVID-19; formally known as 2019-nCoV). Coronaviruses can spread from animals to humans; symptoms include fever, cough, shortness of breath, and breathing difficulties; in more severe cases, infection can lead to death. This feed covers recent research on COVID-19.

Alzheimer's Disease: MS4A

Variants within the membrane-spanning 4-domains subfamily A (MS4A) gene cluster have recently been implicated in Alzheimer's disease in genome-wide association studies. Here is the latest research on Alzheimer's disease and MS4A.

Pediculosis pubis

Pediculosis pubis is a disease caused by a parasitic insect known as Pthirus pubis, which infests human pubic hair, as well as other areas with hair including eye lashes. Here is the latest research.

Rh Isoimmunization

Rh isoimmunization is a potentially preventable condition that occasionally is associated with significant perinatal morbidity or mortality. Discover the latest research on Rh Isoimmunization here.

Genetic Screens in iPSC-derived Brain Cells

Genetic screening is a critical tool that can be employed to define and understand gene function and interaction. This feed focuses on genetic screens conducted using induced pluripotent stem cell (iPSC)-derived brain cells. It also follows CRISPR-Cas9 approaches to generating genetic mutants as a means of understanding the effect of genetics on phenotype.

Enzyme Evolution

This feed focuses on molecular models of enzyme evolution and new approaches (such as adaptive laboratory evolution) to metabolic engineering of microorganisms. Here is the latest research.

Chronic Fatigue Syndrome

Chronic fatigue syndrome is a disease characterized by unexplained disabling fatigue; the pathology of which is incompletely understood. Discover the latest research on chronic fatigue syndrome here.

Pharmacology of Proteinopathies

This feed focuses on the pharmacology of proteinopathies - diseases in which proteins abnormally aggregate (i.e. Alzheimer’s, Parkinson’s, etc.). Discover the latest research in this field with this feed.

Alignment-free Sequence Analysis Tools

Alignment-free sequence analyses have been applied to problems ranging from whole-genome phylogeny to the classification of protein families, identification of horizontally transferred genes, and detection of recombined sequences. Here is the latest research.