Abstract
Data from biomedical domains often have an inherit hierarchical structure. As this structure is usually implicit, its existence can be overlooked by practitioners interested in constructing and evaluating predictive models from such data. Ignoring these constructs leads to potentially problematic and the routinely unrecognized bias in the models and results. In this work, we discuss this bias in detail and propose a simple, sampling-based solution for it. Next, we explore its sources and extent on synthetic data. Finally, we demonstrate how the state-of-the-art variant prioritization framework, eXtasy, benefits from using the described approach in its Random forest-based core classification model. The conducted simulations clearly indicate that the heterogeneous granularity of feature domains poses significant problems for both the standard Random forest classifier and a modification that relies on stratified bootstrapping. Conversely, using the proposed sampling scheme when training the classifier mitigates the described bias. Furthermore, when applied to the eXtasy data under a realistic class distribution scenario, a Random forest learned using the proposed sampling scheme displays much better precision that its standard ver...Continue Reading
References
Sep 23, 1998·Neural Computation·T G Dietterich
Jun 26, 2003·Nucleic Acids Research·Pauline C Ng, Steven Henikoff
Jul 19, 2003·Proceedings of the National Academy of Sciences of the United States of America·Erik C GuntherMelvyn P Heyes
Jul 19, 2005·Genome Research·Adam SiepelDavid Haussler
May 9, 2006·Nature Biotechnology·Stein AertsYves Moreau
Oct 2, 2007·Statistics in Medicine·I R KönigUNKNOWN German Stroke Study Collaboration
Apr 8, 2009·Genome Medicine·Peter D StensonDavid N Cooper
Jul 16, 2009·Genome Research·Sung Chun, Justin C Fay
Mar 12, 2010·The New England Journal of Medicine·James R LupskiRichard A Gibbs
Apr 1, 2010·Nature Methods·Ivan A AdzhubeiShamil R Sunyaev
Aug 3, 2010·Nature Methods·Jana Marie SchwarzDominik Seelow
Oct 27, 2010·PLoS Genetics·Ni HuangMatthew E Hurles
Oct 29, 2010·Nature·Gonçalo R AbecasisGil A McVean
Apr 27, 2011·Human Mutation·Xiaoming LiuEric Boerwinkle
Jul 6, 2011·BMC Bioinformatics·Mihaela PerteaSteven L Salzberg
Jan 21, 2012·Human Heredity·Margarida C LopesEleftheria Zeggini
Jul 13, 2012·BMC Genomics·Mauno Vihinen
Sep 1, 2012·Nature Methods·Sudhir KumarLi Liu
Oct 1, 2013·Nature Methods·Alejandro SifrimYves Moreau
Citations
Oct 16, 2015·Scientific Reports·Mengmeng WuRui Jiang
Oct 16, 2016·Nature Reviews. Genetics·Linda Szabo, Julia Salzman
Apr 25, 2018·Database : the Journal of Biological Databases and Curation·Artur CieslewiczCzeslaw Jedrzejek
Feb 6, 2019·Scientific Data·Susanna ZuccaCristina Cereda