Problems with the nested granularity of feature domains in bioinformatics: the eXtasy case

BMC Bioinformatics
Dusan PopovicBart De Moor

Abstract

Data from biomedical domains often have an inherit hierarchical structure. As this structure is usually implicit, its existence can be overlooked by practitioners interested in constructing and evaluating predictive models from such data. Ignoring these constructs leads to potentially problematic and the routinely unrecognized bias in the models and results. In this work, we discuss this bias in detail and propose a simple, sampling-based solution for it. Next, we explore its sources and extent on synthetic data. Finally, we demonstrate how the state-of-the-art variant prioritization framework, eXtasy, benefits from using the described approach in its Random forest-based core classification model. The conducted simulations clearly indicate that the heterogeneous granularity of feature domains poses significant problems for both the standard Random forest classifier and a modification that relies on stratified bootstrapping. Conversely, using the proposed sampling scheme when training the classifier mitigates the described bias. Furthermore, when applied to the eXtasy data under a realistic class distribution scenario, a Random forest learned using the proposed sampling scheme displays much better precision that its standard ver...Continue Reading

References

Jun 26, 2003·Nucleic Acids Research·Pauline C Ng, Steven Henikoff
Jul 19, 2003·Proceedings of the National Academy of Sciences of the United States of America·Erik C GuntherMelvyn P Heyes
Jul 19, 2005·Genome Research·Adam SiepelDavid Haussler
May 9, 2006·Nature Biotechnology·Stein AertsYves Moreau
Oct 2, 2007·Statistics in Medicine·I R KönigUNKNOWN German Stroke Study Collaboration
Apr 8, 2009·Genome Medicine·Peter D StensonDavid N Cooper
Jul 16, 2009·Genome Research·Sung Chun, Justin C Fay
Mar 12, 2010·The New England Journal of Medicine·James R LupskiRichard A Gibbs
Apr 1, 2010·Nature Methods·Ivan A AdzhubeiShamil R Sunyaev
Aug 3, 2010·Nature Methods·Jana Marie SchwarzDominik Seelow
Oct 27, 2010·PLoS Genetics·Ni HuangMatthew E Hurles
Oct 29, 2010·Nature·Gonçalo R AbecasisGil A McVean
Jul 6, 2011·BMC Bioinformatics·Mihaela PerteaSteven L Salzberg
Jan 21, 2012·Human Heredity·Margarida C LopesEleftheria Zeggini
Sep 1, 2012·Nature Methods·Sudhir KumarLi Liu
Oct 1, 2013·Nature Methods·Alejandro SifrimYves Moreau

❮ Previous
Next ❯

Citations

Oct 16, 2016·Nature Reviews. Genetics·Linda Szabo, Julia Salzman
Apr 25, 2018·Database : the Journal of Biological Databases and Curation·Artur CieslewiczCzeslaw Jedrzejek

❮ Previous
Next ❯

Software Mentioned

LRT
Polyphen
MutationTaster
eXtasy
CAROL
Matlab
Endeavor
SIFT

Related Concepts

Related Feeds

Bioinformatics in Biomedicine

Bioinformatics in biomedicine incorporates computer science, biology, chemistry, medicine, mathematics and statistics. Discover the latest research on bioinformatics in biomedicine here.

Related Papers

IEEE/ACM Transactions on Computational Biology and Bioinformatics
Su YanYing Chen
IEEE/ACM Transactions on Computational Biology and Bioinformatics
Lei YuMichael E Berens
Conference Proceedings : ... Annual International Conference of the IEEE Engineering in Medicine and Biology Society
M AnthimopoulosS Mougiakakou
© 2022 Meta ULC. All rights reserved