Modeling aspects of the language of life through transfer-learning protein sequences

BMC Bioinformatics
Michael HeinzingerBurkhard Rost

Abstract

Predicting protein function and structure from sequence is one important challenge for computational biology. For 26 years, most state-of-the-art approaches combined machine learning and evolutionary information. However, for some applications retrieving related proteins is becoming too time-consuming. Additionally, evolutionary information is less powerful for small families, e.g. for proteins from the Dark Proteome. Both these problems are addressed by the new methodology introduced here. We introduced a novel way to represent protein sequences as continuous vectors (embeddings) by using the language model ELMo taken from natural language processing. By modeling protein sequences, ELMo effectively captured the biophysical properties of the language of life from unlabeled big data (UniRef50). We refer to these new embeddings as SeqVec (Sequence-to-Vector) and demonstrate their effectiveness by training simple neural networks for two different tasks. At the per-residue level, secondary structure (Q3 = 79% ± 1, Q8 = 68% ± 1) and regions with intrinsic disorder (MCC = 0.59 ± 0.03) were predicted significantly better than through one-hot encoding or through Word2vec-like approaches. At the per-protein level, subcellular localizati...Continue Reading

References

Oct 20, 1975·Biochimica Et Biophysica Acta·B W Matthews
Dec 10, 1992·Nature·Burkhard Rost, C Sander
Nov 15, 1992·Proceedings of the National Academy of Sciences of the United States of America·S Henikoff, J G Henikoff
Jul 20, 1973·Science·C B Anfinsen
Jun 1, 1995·Current Opinion in Structural Biology·Geoffrey J Barton
Feb 1, 1995·Protein Science : a Publication of the Protein Society·J M Chandonia, M Karplus
Mar 1, 1995·Protein Science : a Publication of the Protein Society·Burkhard RostChris Sander
Nov 1, 1994·Proteins·Burkhard Rost, Chris Sander
Jul 20, 1993·Journal of Molecular Biology·Burkhard Rost, Chris Sander
Aug 15, 1993·Proceedings of the National Academy of Sciences of the United States of America·Burkhard Rost, Chris Sander
Dec 1, 1995·Protein Science : a Publication of the Protein Society·P K MehtaP Argos
Dec 1, 1995·Proteins·D Frishman, P Argos
Aug 1, 1996·Protein Science : a Publication of the Protein Society·Burkhard RostR Casadio
Sep 1, 1997·Nucleic Acids Research·S F AltschulD J Lipman
Oct 23, 1997·Neural Computation·S Hochreiter, J Schmidhuber
Mar 24, 1999·Trends in Biochemical Sciences·Kenta Nakai, Paul Horton
Dec 11, 1999·Nucleic Acids Research·H M BermanP E Bourne
Dec 11, 1999·Nucleic Acids Research·A Bairoch
Aug 13, 2003·Bioinformatics·Guoli Wang, Roland L Dunbrack
Dec 24, 2003·Protein Science : a Publication of the Protein Society·Predrag RadivojacA Keith Dunker
May 14, 2004·Nucleic Acids Research·Henry R BigelowBurkhard Rost
Nov 24, 2004·Computational Biology and Chemistry·J Gorodkin
Mar 8, 2005·Journal of Bioinformatics and Computational Biology·Kang PengZoran Obradovic
Apr 6, 2005·Journal of Molecular Biology·Rajesh Nair, Burkhard Rost
May 14, 2005·Bioinformatics·Marco Punta, Burkhard Rost
Aug 5, 2005·Proteins·Avner Schlessinger, Burkhard Rost
Aug 19, 2005·Journal of Bioinformatics and Computational Biology·Rui KuangChristina Leslie
Oct 13, 2005·The FEBS Journal·William S NobleJason Weston
Jun 6, 2006·Proteins·Chin-Sheng YuJenn-Kang Hwang
Jan 24, 2007·Bioinformatics·Yanay Ofran, Burkhard Rost
May 23, 2007·Nucleic Acids Research·Paul HortonKenta Nakai
May 29, 2007·Nucleic Acids Research·Yana Bromberg, Burkhard Rost
Jul 17, 2007·PLoS Computational Biology·Yanay Ofran, Burkhard Rost
Jul 31, 2007·PLoS Computational Biology·Avner SchlessingerBurkhard Rost
Aug 22, 2007·Bioinformatics·Avner SchlessingerBurkhard Rost
Nov 16, 2007·Proteins·Mickey Kosloff, Rachel Kolodny
Mar 5, 2008·Methods in Molecular Biology·Vladimir N UverskyA Keith Dunker
May 22, 2009·Nucleic Acids Research·Timothy L BaileyWilliam S Noble
Sep 22, 2009·Journal of Proteome Research·Sebastian BriesemeisterHagit Shatkay
Apr 1, 2010·Nature Methods·Ivan A AdzhubeiShamil R Sunyaev
May 29, 2010·Nucleic Acids Research·Sebastian BriesemeisterOliver Kohlbacher
Nov 23, 2011·Proceedings of the National Academy of Sciences of the United States of America·Faruck MorcosMartin Weigt
Dec 14, 2011·PloS One·Debora S MarksChris Sander
Dec 27, 2011·Nature Methods·Michael RemmertJohannes Söding
Sep 11, 2012·Bioinformatics·Tatyana GoldbergBurkhard Rost
Oct 13, 2012·Bioinformatics·Limin FuWeizhong Li
Nov 10, 2012·Nature Biotechnology·Debora S MarksChris Sander
Dec 4, 2012·Nucleic Acids Research·Sameer VelankarGerard J Kleywegt
Feb 7, 2015·Bioinformatics·Tobias Hamp, Burkhard Rost
Apr 11, 2015·Proceedings of the National Academy of Sciences of the United States of America·Sikander HayatArne Elofsson
Apr 18, 2015·Nucleic Acids Research·Alexey DrozdetskiyGeoffrey J Barton
Nov 1, 2015·Methods in Molecular Biology·Emmanuel BoutetIoannis Xenarios
Nov 19, 2015·Proceedings of the National Academy of Sciences of the United States of America·Nelson PerdigãoSeán I O'Donoghue
Jan 12, 2016·Scientific Reports·Sheng WangJinbo Xu
Apr 27, 2016·Nucleic Acids Research·Sheng WangJinbo Xu
Apr 1, 2013·Intrinsically Disordered Proteins·A Keith DunkerVladimir N Uversky
Sep 14, 2017·Proteins·Daniel W A Buchan, David T Jones
Oct 14, 2017·Bioinformatics·Jose Juan Almagro ArmenterosOle Winther
Oct 17, 2017·Nature Biotechnology·Martin Steinegger, Johannes Söding
Oct 16, 2018·Proteomics·Andrea SchafferhansBurkhard Rost
Oct 28, 2018·BMC Genomics·Marmar Moussa, Ion I Măndoiu
Jun 13, 2019·BMC Bioinformatics·Mohammed AlQuraishi
Sep 16, 2019·BMC Bioinformatics·Martin SteineggerJohannes Söding

Citations

Nov 26, 2020·Chemical Communications : Chem Comm·Jan ZauchaGrzegorz M Popowicz
Sep 12, 2020·Bioinformatics·Maximilian CollatzManja Marz

Related Concepts

Machine Learning
Natural Language Processing
Gene Products, Protein
Neural Network Simulation
Sequence Determinations
Computational Molecular Biology
GenBank
Protein Structure Databases
Proteomics
Base Sequence

Trending Feeds

COVID-19

Coronaviruses encompass a large family of viruses that cause the common cold as well as more serious diseases, such as the ongoing outbreak of coronavirus disease 2019 (COVID-19; formally known as 2019-nCoV). Coronaviruses can spread from animals to humans; symptoms include fever, cough, shortness of breath, and breathing difficulties; in more severe cases, infection can lead to death. This feed covers recent research on COVID-19.

Synthetic Genetic Array Analysis

Synthetic genetic arrays allow the systematic examination of genetic interactions. Here is the latest research focusing on synthetic genetic arrays and their analyses.

Neural Activity: Imaging

Imaging of neural activity in vivo has developed rapidly recently with the advancement of fluorescence microscopy, including new applications using miniaturized microscopes (miniscopes). This feed follows the progress in this growing field.

Computational Methods for Protein Structures

Computational methods employing machine learning algorithms are powerful tools that can be used to predict the effect of mutations on protein structure. This is important in neurodegenerative disorders, where some mutations can cause the formation of toxic protein aggregations. This feed follows the latests insights into the relationships between mutation and protein structure leading to better understanding of disease.

Congenital Hyperinsulinism

Congenital hyperinsulinism is caused by genetic mutations resulting in excess insulin secretion from beta cells of the pancreas. Here is the latest research.

Chronic Fatigue Syndrome

Chronic fatigue syndrome is a disease characterized by unexplained disabling fatigue; the pathology of which is incompletely understood. Discover the latest research on chronic fatigue syndrome here.

Epigenetic Memory

Epigenetic memory refers to the heritable genetic changes that are not explained by the DNA sequence. Find the latest research on epigenetic memory here.

Cell Atlas of the Human Eye

Constructing a cell atlas of the human eye will require transcriptomic and histologic analysis over the lifespan. This understanding will aid in the study of development and disease. Find the latest research pertaining to the Cell Atlas of the Human Eye here.

Femoral Neoplasms

Femoral Neoplasms are bone tumors that arise in the femur. Discover the latest research on femoral neoplasms here.

Related Papers

BioRxiv : the Preprint Server for Biology
Michael HeinzingerBurkhard Rost
Journal of Biomedical Informatics
Katikapalli Subramanyam Kalyan, S Sangeetha
Journal of the American Medical Informatics Association : JAMIA
Yuqi SiKirk Roberts
© 2021 Meta ULC. All rights reserved