Obtaining maximal concatenated phylogenetic data sets from large sequence databases

Molecular Biology and Evolution
Michael J SandersonSasha Langley

Abstract

To improve the accuracy of tree reconstruction, phylogeneticists are extracting increasingly large multigene data sets from sequence databases. Determining whether a database contains at least k genes sampled from at least m species is an NP-complete problem. However, the skewed distribution of sequences in these databases permits all such data sets to be obtained in reasonable computing times even for large numbers of sequences. We developed an exact algorithm for obtaining the largest multigene data sets from a collection of sequences. The algorithm was then tested on a set of 100,000 protein sequences of green plants and used to identify the largest multigene ortholog data sets having at least 3 genes and 6 species. The distribution of sizes of these data sets forms a hollow curve, and the largest are surprisingly small, ranging from 62 genes by 6 species, to 3 genes by 65 species, with more symmetrical data sets of around 15 taxa by 15 genes. These upper bounds to sequence concatenation have important implications for building the tree of life from large sequence databases.

Citations

May 17, 2006·Trends in Ecology & Evolution·Olaf R P Bininda-Emonds
Aug 21, 2003·Trends in Plant Science·Michael J Sanderson, Amy C Driskell
Apr 30, 2005·Nature Reviews. Genetics·Frédéric DelsucHervé Philippe
Jun 28, 2008·Molecular Biology and Evolution·Antonis Rokas, Sean B Carroll
Jul 3, 2009·Philosophical Transactions of the Royal Society of London. Series B, Biological Sciences·Leanne S HaggertyJames O McInerney
Oct 16, 2004·Annual Review of Genomics and Human Genetics·Justin O Borevitz, Joseph R Ecker
May 27, 2010·BMC Evolutionary Biology·Michael J SandersonMike Steel
Feb 10, 2007·BMC Evolutionary Biology·Béatrice RoureHervé Philippe
Feb 10, 2007·BMC Evolutionary Biology·Michael J Sanderson, Michelle M McMahon
Feb 13, 2009·BMC Evolutionary Biology·Stephen A SmithMichael J Donoghue
Aug 15, 2009·Computational Biology and Chemistry·Anup Som
Jun 2, 2009·Molecular Phylogenetics and Evolution·Anup Som, Georg Fuellen
May 27, 2008·Gene·Fengrong RenZiheng Yang
May 10, 2005·Molecular Phylogenetics and Evolution·Changhui YanOliver Eulenstein
Mar 25, 2008·Molecular Phylogenetics and Evolution·Yanis Bouchenak-KhelladiTrevor R Hodkinson
May 30, 2014·Briefings in Bioinformatics·Anup Som
Sep 7, 2014·Molecular Phylogenetics and Evolution·Jin-Mei FengWei Miao
Feb 23, 2010·Virology·Etienne P de VilliersRichard P Bishop
Jun 1, 2005·Journal of Biomedical Informatics·John J Wiens
Dec 13, 2006·FEMS Yeast Research·Eiko E KuramaeTeun Boekhout
Feb 14, 2015·PloS One·Michelle M McMahonMichael J Sanderson
Nov 26, 2009·PLoS Genetics·Michael A WhiteBret A Payseur
Jan 19, 2012·PloS One·Ruth E TimmeCharles F Delwiche
Sep 15, 2009·PLoS Computational Biology·Cuong Than, Luay Nakhleh
Feb 18, 2016·Molecular Phylogenetics and Evolution·Nicholas W PersonsRebecca T Kimball
Feb 7, 2015·Molecular Phylogenetics and Evolution·Liang TangSong Ge

❮ Previous
Next ❯

Related Concepts

Trending Feeds

COVID-19

Coronaviruses encompass a large family of viruses that cause the common cold as well as more serious diseases, such as the ongoing outbreak of coronavirus disease 2019 (COVID-19; formally known as 2019-nCoV). Coronaviruses can spread from animals to humans; symptoms include fever, cough, shortness of breath, and breathing difficulties; in more severe cases, infection can lead to death. This feed covers recent research on COVID-19.

Blastomycosis

Blastomycosis fungal infections spread through inhaling Blastomyces dermatitidis spores. Discover the latest research on blastomycosis fungal infections here.

Nuclear Pore Complex in ALS/FTD

Alterations in nucleocytoplasmic transport, controlled by the nuclear pore complex, may be involved in the pathomechanism underlying multiple neurodegenerative diseases including Amyotrophic Lateral Sclerosis and Frontotemporal Dementia. Here is the latest research on the nuclear pore complex in ALS and FTD.

Applications of Molecular Barcoding

The concept of molecular barcoding is that each original DNA or RNA molecule is attached to a unique sequence barcode. Sequence reads having different barcodes represent different original molecules, while sequence reads having the same barcode are results of PCR duplication from one original molecule. Discover the latest research on molecular barcoding here.

Chronic Fatigue Syndrome

Chronic fatigue syndrome is a disease characterized by unexplained disabling fatigue; the pathology of which is incompletely understood. Discover the latest research on chronic fatigue syndrome here.

Evolution of Pluripotency

Pluripotency refers to the ability of a cell to develop into three primary germ cell layers of the embryo. This feed focuses on the mechanisms that underlie the evolution of pluripotency. Here is the latest research.

Position Effect Variegation

Position Effect Variagation occurs when a gene is inactivated due to its positioning near heterochromatic regions within a chromosome. Discover the latest research on Position Effect Variagation here.

STING Receptor Agonists

Stimulator of IFN genes (STING) are a group of transmembrane proteins that are involved in the induction of type I interferon that is important in the innate immune response. The stimulation of STING has been an active area of research in the treatment of cancer and infectious diseases. Here is the latest research on STING receptor agonists.

Microbicide

Microbicides are products that can be applied to vaginal or rectal mucosal surfaces with the goal of preventing, or at least significantly reducing, the transmission of sexually transmitted infections. Here is the latest research on microbicides.