Efficient coalescent simulation and genealogical analysis for large sample sizes

BioRxiv : the Preprint Server for Biology
Jerome KelleherGil McVean

Abstract

A central challenge in the analysis of genetic variation is to provide realistic genome simulation across millions of samples. Present day coalescent simulations do not scale well, or use approximations that fail to capture important long-range linkage properties. Analysing the results of simulations also presents a substantial challenge, as current methods to store genealogies consume a great deal of space, are slow to parse and do not take advantage of shared structure in correlated trees. We solve these problems by introducing sparse trees and coalescence records as the key units of genealogical analysis. Using these tools, exact simulation of the coalescent with recombination for chromosome-sized regions over hundreds of thousands of samples is possible, and substantially faster than present-day approximate methods. We can also analyse the results orders of magnitude more quickly than with existing methods.

Related Concepts

Genetic Pedigree
Recombination, Genetic
Trees (plant)
Anatomical Space Structure
Size
Structure
Simulation
Analysis

Related Feeds

BioRxiv & MedRxiv Preprints

BioRxiv and MedRxiv are the preprint servers for biology and health sciences respectively, operated by Cold Spring Harbor Laboratory. Here are the latest preprint articles (which are not peer-reviewed) from BioRxiv and MedRxiv.

Related Papers

BioRxiv : the Preprint Server for Biology
Jerome KelleherPeter Ralph
Philosophical Transactions of the Royal Society of London. Series B, Biological Sciences
Gilean A T McVean, Niall J Cardin
© 2021 Meta ULC. All rights reserved