Extensive error in the number of genes inferred from draft genome assemblies

PLoS Computational Biology
James F DentonMatthew W Hahn

Abstract

Current sequencing methods produce large amounts of data, but genome assemblies based on these data are often woefully incomplete. These incomplete and error-filled assemblies result in many annotation errors, especially in the number of genes present in a genome. In this paper we investigate the magnitude of the problem, both in terms of total gene number and the number of copies of genes in specific families. To do this, we compare multiple draft assemblies against higher-quality versions of the same genomes, using several new assemblies of the chicken genome based on both traditional and next-generation sequencing technologies, as well as published draft assemblies of chimpanzee. We find that upwards of 40% of all gene families are inferred to have the wrong number of genes in draft assemblies, and that these incorrect assemblies both add and subtract genes. Using simulated genome assemblies of Drosophila melanogaster, we find that the major cause of increased gene numbers in draft genomes is the fragmentation of genes onto multiple individual contigs. Finally, we demonstrate the usefulness of RNA-Seq in improving the gene annotation of draft assemblies, largely by connecting genes that have been fragmented in the assembly p...Continue Reading

References

Apr 25, 1997·Journal of Molecular Biology·C Burge, S Karlin
Sep 1, 1997·Nucleic Acids Research·S F AltschulD J Lipman
Jul 29, 1998·DNA Research : an International Journal for Rapid Publication of Reports on Genes and Genomes·S SaitoY Nakamura
Mar 24, 2000·Science·Eugene W MyersJ Craig Venter
Mar 24, 2000·Science·G M RubinS Lewis
Apr 26, 2000·Genome Research·Asaf Salamov, Victor Solovyev
Mar 10, 2001·Nature·Eric S LanderInternational Human Genome Sequencing Consortium
May 19, 2001·Science·Steven L SalzbergJ A Eisen
Mar 28, 2002·Nucleic Acids Research·A J EnrightC A Ouzounis
Oct 5, 2002·Science·Robert A HoltStephen L Hoffman
Nov 20, 2002·Science·Soren BrunakDaphne Preuss
Jun 7, 2003·Science·Elizabeth Pennisi
Oct 10, 2003·Bioinformatics·Mario Stanke, Stephan Waack
May 5, 2004·Proceedings of the National Academy of Sciences of the United States of America·Ted JonesStewart Scherer
Sep 24, 2004·Briefings in Bioinformatics·Mihai PopSteven L Salzberg
Dec 14, 2004·Nature·International Chicken Genome Sequencing Consortium
Mar 9, 2005·BMC Bioinformatics·John K ColbourneDon G Gilbert
Aug 4, 2005·Genome Research·Jade P VinsonEric S Lander
Sep 2, 2005·Nature·Chimpanzee Sequencing and Analysis Consortium
Jun 3, 2006·Genome Research·René L WarrenSteven J M Jones
Jul 18, 2006·Cancer Genetics and Cytogenetics·Wonhee JangIlan R Kirsch
Dec 22, 2006·PloS One·Jeffery P DemuthMatthew W Hahn
Mar 3, 2007·Bioinformatics·Genis ParraIan F Korf
Sep 11, 2007·Nature Genetics·George H PerryAnne C Stone
Nov 14, 2007·PLoS Genetics·Matthew W HahnSang-Gook Han
Nov 21, 2007·Genome Research·Brandi L CantarelMark Yandell
Jan 22, 2008·Bioinformatics·Jeong-Hyeon ChoiJohn K Colbourne
Jan 30, 2008·PLoS Computational Biology·Guo-Cheng Yuan, Jun S Liu
Mar 18, 2008·Genome Biology·Adam M PhillippyMihai Pop
Sep 27, 2008·PLoS Computational Biology·Steven L SalzbergVincent T Lee
Dec 2, 2008·Nucleic Acids Research·Genis ParraIan F Korf
Dec 18, 2008·Genome Biology·France DenoeudFrançois Artiguenave
Dec 20, 2008·BMC Bioinformatics·Kyoung-Jae WonWei Wang
Jan 1, 2008·Cytogenetic and Genome Research·G H Perry
May 20, 2009·Bioinformatics·Heng Li, Richard Durbin
May 27, 2009·PLoS Biology·Deanna M ChurchMouse Genome Sequencing Consortium
Jun 10, 2009·Bioinformatics·Heng Li1000 Genome Project Data Processing Subgroup
Sep 17, 2009·Journal of Computational Biology : a Journal of Computational Molecular Cell Biology·Daniel R SchriderMatthew W Hahn
Oct 9, 2009·Nature·Donald F ConradMatthew E Hurles
Mar 23, 2010·Genome Research·Stephen MeaderGerton Lunter
Oct 29, 2010·Genome Research·Ali MortazaviPaul W Sternberg
Nov 26, 2010·Nature Methods·Can AlkanEvan E Eichler
Dec 24, 2010·Nature·Brenton R GraveleySusan E Celniker
Feb 5, 2011·Science·John K ColbourneJeffrey L Boore
Feb 16, 2011·Genome Biology·Albert J VilellaJavier Herrero
Feb 23, 2011·PloS One·Melissa J HubiszAdam Siepel
Sep 9, 2011·Journal of Computational Biology : a Journal of Computational Molecular Cell Biology
Nov 17, 2011·Nucleic Acids Research·Paul FlicekStephen M J Searle
Dec 3, 2011·Genome Research·Daniel R SchriderMatthew W Hahn
Dec 14, 2011·Genome Research·Jared T Simpson, Richard Durbin
Jan 12, 2012·BMC Genomics·Jonathan L Klassen, Cameron R Currie
Jun 1, 2012·BMC Genomics·Xiongfei ZhangRobert B Norgren
Jul 4, 2012·Nature Genetics·Qiang QiuJianquan Liu
Nov 6, 2012·Nucleic Acids Research·Steven J MarygoldFlyBase consortium
May 29, 2013·Genome Biology·Martin HuntThomas D Otto
Sep 10, 2013·BMC Genomics·Wei XueXiao-Wen Sun
Sep 12, 2013·Bioinformatics·Mark HowisonCasey W Dunn
Feb 1, 2014·BMC Genomics·Christine G ElsikHoney Bee Genome Sequencing Consortium
Feb 18, 2014·G3 : Genes - Genomes - Genetics·Matthew W HahnLeonie C Moyle

Citations

Jan 24, 2016·Plant Biotechnology Journal·Sebastian BeierMartin Mascher
Jun 25, 2015·Computational and Structural Biotechnology Journal·Catherine E Grueber
Jul 16, 2015·Molecular Ecology·P J SeearI Barber
Mar 30, 2016·Trends in Plant Science·Natasha M GloverChristophe Dessimoz
Sep 24, 2015·BMC Genomics·Matthew A Conte, Thomas D Kocher
Jul 24, 2015·Genome Biology·Diana Le DucTorsten Schöneberg
May 26, 2015·Nature Biotechnology·Konstantin BerlinAdam M Phillippy
May 7, 2016·Bioinformatics·William ChowKerstin Howe
Mar 28, 2016·Genome Biology and Evolution·Jana AsselmanKarel A C De Schamphelaere
Jun 2, 2016·BMC Bioinformatics·Neel Prabh, Christian Rödelsperger
Aug 31, 2016·Proceedings of the National Academy of Sciences of the United States of America·Filip Husnik, John P McCutcheon
May 2, 2017·Genome Biology and Evolution·Zhen LiYves Van de Peer
Feb 20, 2018·Molecular Ecology Resources·Anna MarcionettiNicolas Salamin
Oct 17, 2017·DNA Research : an International Journal for Rapid Publication of Reports on Genes and Genomes·Ji Ye YanXing Hong Li
Dec 8, 2016·Nucleic Acids Research·Wenyu ShiFangqing Zhao
Sep 30, 2017·Molecular Biology and Evolution·Fidel Botero-CastroNicolas Galtier
Jan 8, 2018·Microbial Genomics·Inès LevadeB Jesse Shapiro
Jul 18, 2018·Genes·László BányaiLászló Patthy
Jun 15, 2018·The Plant Journal : for Cell and Molecular Biology·Claudio Casola, Tomasz E Koralewski
Oct 10, 2015·BMC Bioinformatics·Mathias WellerRodolphe Giroudeau
Feb 1, 2018·BMC Bioinformatics·Dang LiuIsheng J Tsai
Nov 22, 2018·Briefings in Functional Genomics·YongKiat WeeMin Zhao
Aug 14, 2018·Bioinformatics·Reuben J Pengelly, Andrew Collins
May 28, 2019·Genome Research·Jun YoshimuraErich M Schwarz
Jul 10, 2018·Integrative and Comparative Biology·Christopher E Laumer
Dec 12, 2019·The Plant Journal : for Cell and Molecular Biology·Alexander J Trouern-TrendJill L Wegrzyn
Oct 22, 2019·Molecular Biology and Evolution·Xuzhe ZhangLouis J Muglia
Mar 13, 2020·Genome Biology and Evolution·Bastian Greshake TzovarasIngo Ebersberger
Nov 16, 2019·Scientific Reports·Marko Premzl
Oct 5, 2018·Frontiers in Genetics·Shubham K JaiswalVineet K Sharma
Dec 6, 2019·Scientific Reports·Parul MittalVineet K Sharma
Jun 21, 2020·Communications Biology·Robert Rodgers FitakPamela Anna Burger
Mar 23, 2017·Scientific Reports·Georgia TsagkogeorgaStephen J Rossiter
Apr 22, 2017·DNA Research : an International Journal for Rapid Publication of Reports on Genes and Genomes·Yongbin Zhuang, Erin A Tripp
Nov 18, 2018·TAG. Theoretical and applied genetics. Theoretische und angewandte Genetik·Cécile MonatMartin Mascher
Nov 15, 2018·Biology Open·Jin-Mei FengHan-Bing Xiao
Apr 9, 2020·BMC Genomics·Patrick J MonnahanCandice N Hirsch
Jun 13, 2020·Frontiers in Microbiology·Romário Oliveira de SalesPatricia Severino
Oct 26, 2016·Standards in Genomic Sciences·Travis K PriceAlan J Wolfe
Jul 5, 2016·Biotechnology Journal·Nandita VishwanathanWei-Shou Hu
May 7, 2019·Genome Biology and Evolution·Hua YingDavid Miller
Oct 18, 2019·Molecular Genetics and Genomics : MGG·Paweł OsipowskiWojciech Pląder
Jul 24, 2020·Genome Biology·Gerry Tonkin-HillJulian Parkhill
Mar 5, 2016·PloS One·Michael P DenyerAdrian J Shepherd
Jul 21, 2016·GigaScience·Simo V ZhangMatthew W Hahn
Feb 27, 2017·G3 : Genes - Genomes - Genetics·Zhiqiang YeMichael Lynch
Nov 23, 2017·Frontiers in Cellular and Infection Microbiology·Guillermo Nourdin-GalindoAlejandro J Yañez
Jun 30, 2017·Molecular Genetics and Genomics : MGG·Sujata Mohanty, Radhika Khanna
Jun 4, 2020·Scientific Reports·Melak WeldenegodguadJuha Kantanen
Feb 5, 2020·Frontiers in Genetics·De-Long GuanQiang Qiu
Nov 19, 2020·Database : the Journal of Biological Databases and Curation·Alejandro RubioAntonio J Pérez-Pulido
Jan 2, 2021·Frontiers in Genetics·Soma S MarlaRajesh Kumar
Nov 24, 2019·Zoology : Analysis of Complex Systems, ZACS·Ira CookeConsortium of Australian Academy of Science Boden Research Conference Participants

Related Concepts

Drosophila melanogaster
Genes
Genome
RNA
Simulation
Molecular Assembly/Self Assembly
Genome Sequencing
ETHE1 gene
probe gene fragment
Gene Annotation

Trending Feeds

COVID-19

Coronaviruses encompass a large family of viruses that cause the common cold as well as more serious diseases, such as the ongoing outbreak of coronavirus disease 2019 (COVID-19; formally known as 2019-nCoV). Coronaviruses can spread from animals to humans; symptoms include fever, cough, shortness of breath, and breathing difficulties; in more severe cases, infection can lead to death. This feed covers recent research on COVID-19.

Synthetic Genetic Array Analysis

Synthetic genetic arrays allow the systematic examination of genetic interactions. Here is the latest research focusing on synthetic genetic arrays and their analyses.

Neural Activity: Imaging

Imaging of neural activity in vivo has developed rapidly recently with the advancement of fluorescence microscopy, including new applications using miniaturized microscopes (miniscopes). This feed follows the progress in this growing field.

Computational Methods for Protein Structures

Computational methods employing machine learning algorithms are powerful tools that can be used to predict the effect of mutations on protein structure. This is important in neurodegenerative disorders, where some mutations can cause the formation of toxic protein aggregations. This feed follows the latests insights into the relationships between mutation and protein structure leading to better understanding of disease.

Congenital Hyperinsulinism

Congenital hyperinsulinism is caused by genetic mutations resulting in excess insulin secretion from beta cells of the pancreas. Here is the latest research.

Chronic Fatigue Syndrome

Chronic fatigue syndrome is a disease characterized by unexplained disabling fatigue; the pathology of which is incompletely understood. Discover the latest research on chronic fatigue syndrome here.

Epigenetic Memory

Epigenetic memory refers to the heritable genetic changes that are not explained by the DNA sequence. Find the latest research on epigenetic memory here.

Cell Atlas of the Human Eye

Constructing a cell atlas of the human eye will require transcriptomic and histologic analysis over the lifespan. This understanding will aid in the study of development and disease. Find the latest research pertaining to the Cell Atlas of the Human Eye here.

Femoral Neoplasms

Femoral Neoplasms are bone tumors that arise in the femur. Discover the latest research on femoral neoplasms here.