Apr 29, 2020

Large-scale Inference of Population Structure in Presence of Missingness using PCA

BioRxiv : the Preprint Server for Biology
Jonas MeisnerM. Huang

Abstract

Principal component analysis (PCA) is a commonly used tool in genetics to capture and visualize population structure. Due to technological advances in sequencing, such as the widely used non-invasive prenatal test, massive datasets of ultra-low coverage sequencing are being generated. These datasets are characterized by having a large amount of missing genotype information. We present EMU, a method for inferring population structure in the presence of rampant non-random missingness. We show through simulations that several commonly used PCA methods can not handle missing data arisen from various sources, which leads to biased results as individuals are projected into the PC space based on their amount of missingness. In terms of accuracy, EMU outperforms an existing method that also accommodates missingness while being competitively fast. We further tested EMU on around 100K individuals of the Phase 1 dataset of the Chinese Millionome Project, that were shallowly sequenced to around 0.08x. From this data we are able to capture the population structure of the Han Chinese and to reproduce previous analysis in a matter of CPU hours instead of CPU years. EMU's capability to accurately infer population structure in the presence of m...Continue Reading

  • References
  • Citations

References

  • We're still populating references for this paper, please check back later.
  • References
  • Citations

Citations

  • This paper may not have been cited yet.

Mentioned in this Paper

HRAS
Molecular Dynamics
Study
Regulation of Biological Process
Mutant Proteins
Etiology
Pharmacologic Substance
Nucleotides
Mutant
Simulation

Related Feeds

BioRxiv & MedRxiv Preprints

BioRxiv and MedRxiv are the preprint servers for biology and health sciences respectively, operated by Cold Spring Harbor Laboratory. Here are the latest preprint articles (which are not peer-reviewed) from BioRxiv and MedRxiv.