Using 2k + 2 bubble searches to find SNPs in k-mer graphs

BioRxiv : the Preprint Server for Biology
Reda Younsi, Dan MacLean

Abstract

Single Nucleotide Polymorphism (SNP) discovery is an important preliminary for understanding genetic variation. With current sequencing methods we can sample genomes comprehensively. SNPs are found by aligning sequence reads against longer assembled references. De Bruijn graphs are efficient data structures that can deal with the vast amount of data from modern technologies. Recent work has shown that the topology of these graphs captures enough information to allow the detection and characterisation of genetic variants, offering an alternative to alignment-based methods. Such methods rely on depth-first walks of the graph to identify closing bifurcations. These methods are conservative or generate many false-positive results, particularly when traversing highly inter-connected (complex) regions of the graph or in regions of very high coverage. We devised an algorithm that calls SNPs in converted De Bruijn graphs by enumerating 2 k + 2 cycles. We evaluated the accuracy of predicted SNPs by comparison with SNP lists from alignment based methods. We tested accuracy of the SNP calling using sequence data from sixteen ecotypes of Arabidopsis thaliana and found that accuracy was high. We found that SNP calling was even across the ge...Continue Reading

Associated Software

May 20, 2014·Reda Younsi, Dan MacLean

Related Concepts

Cocaine
Genome
Trees (plant)
Sequence Alignment
Arabidopsis thaliana <plant>
Cytochrome-c Oxidase Deficiency
GPER protein, human
Structure
Gene Mutant
Single Nucleotide Polymorphism

Related Feeds

BioRxiv & MedRxiv Preprints

BioRxiv and MedRxiv are the preprint servers for biology and health sciences respectively, operated by Cold Spring Harbor Laboratory. Here are the latest preprint articles (which are not peer-reviewed) from BioRxiv and MedRxiv.