Scaling statistical multiple sequence alignment to large datasets

BMC Genomics
Michael Nute, Tandy Warnow

Abstract

Multiple sequence alignment is an important task in bioinformatics, and alignments of large datasets containing hundreds or thousands of sequences are increasingly of interest. While many alignment methods exist, the most accurate alignments are likely to be based on stochastic models where sequences evolve down a tree with substitutions, insertions, and deletions. While some methods have been developed to estimate alignments under these stochastic models, only the Bayesian method BAli-Phy has been able to run on even moderately large datasets, containing 100 or so sequences. A technique to extend BAli-Phy to enable alignments of thousands of sequences could potentially improve alignment and phylogenetic tree accuracy on large-scale data beyond the best-known methods today. We use simulated data with up to 10,000 sequences representing a variety of model conditions, including some that are significantly divergent from the statistical models used in BAli-Phy and elsewhere. We give a method for incorporating BAli-Phy into PASTA and UPP, two strategies for enabling alignment methods to scale to large datasets, and give alignment and tree accuracy results measured against the ground truth from simulations. Comparable results are al...Continue Reading

References

Jun 2, 1998·Bioinformatics·J StoyeF Meyer
Jul 24, 2002·Nucleic Acids Research·Kazutaka KatohTakashi Miyata
Feb 26, 2004·Journal of Computational Biology : a Journal of Computational Molecular Cell Biology·Gerton LunterJotun Hein
Apr 5, 2005·BMC Bioinformatics·Gerton LunterJotun Hein
Jul 14, 2005·Systematic Biology·Benjamin Redelings, Marc A Suchard
May 9, 2006·Current Opinion in Structural Biology·Robert C Edgar, Serafim Batzoglou
Jul 25, 2007·Bioinformatics·Travis J Wheeler, John D Kececioglu
May 9, 2009·Molecular Biology and Evolution·William Fletcher, Ziheng Yang
Mar 13, 2010·PloS One·Morgan N PriceAdam P Arkin
Apr 28, 2010·Bioinformatics·Jeet Sukumaran, Mark T Holder
Oct 11, 2011·Bioinformatics·Siavash Mirarab, Tandy Warnow
Jun 13, 2012·Nature Methods·Nicola SegataCurtis Huttenhower
Jan 1, 2013·Proceedings of the National Academy of Sciences of the United States of America·Alexandre Bouchard-Côté, Michael I Jordan
Nov 2, 2014·Bioinformatics·Nam-Phuong D NguyenTandy Warnow
Dec 31, 2014·Journal of Computational Biology : a Journal of Computational Molecular Cell Biology·Siavash MirarabTandy Warnow
Jun 17, 2015·Genome Biology·Nam-Phuong D NguyenTandy Warnow
Dec 15, 2015·BMC Bioinformatics·Elena Rivas, Sean R Eddy

Citations

May 18, 2018·Systematic Biology·Haim AshkenazyTal Pupko
Aug 17, 2020·Systematic Biology·Daniel M Portik, John J Wiens

Related Concepts

Trending Feeds

COVID-19

Coronaviruses encompass a large family of viruses that cause the common cold as well as more serious diseases, such as the ongoing outbreak of coronavirus disease 2019 (COVID-19; formally known as 2019-nCoV). Coronaviruses can spread from animals to humans; symptoms include fever, cough, shortness of breath, and breathing difficulties; in more severe cases, infection can lead to death. This feed covers recent research on COVID-19.

STING Receptor Agonists

Stimulator of IFN genes (STING) are a group of transmembrane proteins that are involved in the induction of type I interferon that is important in the innate immune response. The stimulation of STING has been an active area of research in the treatment of cancer and infectious diseases. Here is the latest research on STING receptor agonists.

Chronic Fatigue Syndrome

Chronic fatigue syndrome is a disease characterized by unexplained disabling fatigue; the pathology of which is incompletely understood. Discover the latest research on chronic fatigue syndrome here.

Hereditary Sensory Autonomic Neuropathy

Hereditary Sensory Autonomic Neuropathies are a group of inherited neurodegenerative disorders characterized clinically by loss of sensation and autonomic dysfunction. Here is the latest research on these neuropathies.

Spatio-Temporal Regulation of DNA Repair

DNA repair is a complex process regulated by several different classes of enzymes, including ligases, endonucleases, and polymerases. This feed focuses on the spatial and temporal regulation that accompanies DNA damage signaling and repair enzymes and processes.

Glut1 Deficiency

Glut1 deficiency, an autosomal dominant, genetic metabolic disorder associated with a deficiency of GLUT1, the protein that transports glucose across the blood brain barrier, is characterized by mental and motor developmental delays and infantile seizures. Follow the latest research on Glut1 deficiency with this feed.

Separation Anxiety

Separation anxiety is a type of anxiety disorder that involves excessive distress and anxiety with separation. This may include separation from places or people to which they have a strong emotional connection with. It often affects children more than adults. Here is the latest research on separation anxiety.

KIF1A Associated Neurological Disorder

KIF1A associated neurological disorder (KAND) is a rare neurodegenerative condition caused by mutations in the KIF1A gene. KAND may present with a wide range and severity of symptoms including stiff or weak leg muscles, low muscle tone, a lack of muscle coordination and balance, and intellectual disability. Find the latest research on KAND here.

Regulation of Vocal-Motor Plasticity

Dopaminergic projections to the basal ganglia and nucleus accumbens shape the learning and plasticity of motivated behaviors across species including the regulation of vocal-motor plasticity and performance in songbirds. Discover the latest research on the regulation of vocal-motor plasticity here.

© 2021 Meta ULC. All rights reserved