On genomic repeats and reproducibility

Bioinformatics
Can Firtina, Can Alkan

Abstract

Here, we present a comprehensive analysis on the reproducibility of computational characterization of genomic variants using high throughput sequencing data. We reanalyzed the same datasets twice, using the same tools with the same parameters, where we only altered the order of reads in the input (i.e. FASTQ file). Reshuffling caused the reads from repetitive regions being mapped to different locations in the second alignment, and we observed similar results when we only applied a scatter/gather approach for read mapping-without prior shuffling. Our results show that, some of the most common variation discovery algorithms do not handle the ambiguous read mappings accurately when random locations are selected. In addition, we also observed that even when the exact same alignment is used, the GATK HaplotypeCaller generates slightly different call sets, which we pinpoint to the variant filtration step. We conclude that, algorithms at each step of genomic variation discovery and characterization need to treat ambiguous mappings in a deterministic fashion to ensure full replication of results. Code, scripts and the generated VCF files are available at DOI:10.5281/zenodo.32611. calkan@cs.bilkent.edu.tr Supplementary data are availabl...Continue Reading

References

Jun 5, 2002·Genome Research·W James KentDavid Haussler
Mar 6, 2009·Genome Biology·Ben LangmeadSteven L Salzberg
May 19, 2009·Genome Research·Fereydoun HormozdiariS Cenk Sahinalp
Jun 10, 2009·Bioinformatics·Heng Li1000 Genome Project Data Processing Subgroup
Jan 30, 2010·Bioinformatics·Aaron R Quinlan, Ira M Hall
Jun 10, 2010·Bioinformatics·Fereydoun HormozdiariS Cenk Sahinalp
Nov 4, 2011·Genome Research·Fereydoun HormozdiariS Cenk Sahinalp
Nov 30, 2011·Nature Reviews. Genetics·Todd J Treangen, Steven L Salzberg
Aug 28, 2012·Bioinformatics·David WeeseKnut Reinert
Jun 28, 2014·Genome Biology·Ryan M LayerIra M Hall
Jan 27, 2015·Nature Genetics·Robert E HandsakerSteven A McCarroll
Sep 19, 2015·PloS One·Pınar KavakMahmut Şamil Sağıroğlu
Oct 4, 2015·Nature·1000 Genomes Project ConsortiumGonçalo R Abecasis
Nov 6, 2015·BioMed Research International·Adam Cornish, Chittibabu Guda

Citations

Apr 5, 2018·Bioinformatics·Lukasz RoguskiSebastian Deorowicz
Jan 31, 2018·G3 : Genes - Genomes - Genetics·Charles Addo-QuayeBrian P Dilkes
Dec 14, 2019·Genes & Genomics·Da-Hye SonYong-Min Kim
Apr 27, 2018·BMC Medical Genomics·Tony KuoPaul Horton
Jan 20, 2017·BMC Genomics·Ole K TørresenAlexander J Nederbragt
Nov 14, 2018·Bioinformatics·Ibrahim NumanagicFaraz Hach

Related Concepts

Trending Feeds

COVID-19

Coronaviruses encompass a large family of viruses that cause the common cold as well as more serious diseases, such as the ongoing outbreak of coronavirus disease 2019 (COVID-19; formally known as 2019-nCoV). Coronaviruses can spread from animals to humans; symptoms include fever, cough, shortness of breath, and breathing difficulties; in more severe cases, infection can lead to death. This feed covers recent research on COVID-19.

STING Receptor Agonists

Stimulator of IFN genes (STING) are a group of transmembrane proteins that are involved in the induction of type I interferon that is important in the innate immune response. The stimulation of STING has been an active area of research in the treatment of cancer and infectious diseases. Here is the latest research on STING receptor agonists.

Chronic Fatigue Syndrome

Chronic fatigue syndrome is a disease characterized by unexplained disabling fatigue; the pathology of which is incompletely understood. Discover the latest research on chronic fatigue syndrome here.

Spatio-Temporal Regulation of DNA Repair

DNA repair is a complex process regulated by several different classes of enzymes, including ligases, endonucleases, and polymerases. This feed focuses on the spatial and temporal regulation that accompanies DNA damage signaling and repair enzymes and processes.

Glut1 Deficiency

Glut1 deficiency, an autosomal dominant, genetic metabolic disorder associated with a deficiency of GLUT1, the protein that transports glucose across the blood brain barrier, is characterized by mental and motor developmental delays and infantile seizures. Follow the latest research on Glut1 deficiency with this feed.

Hereditary Sensory Autonomic Neuropathy

Hereditary Sensory Autonomic Neuropathies are a group of inherited neurodegenerative disorders characterized clinically by loss of sensation and autonomic dysfunction. Here is the latest research on these neuropathies.

Separation Anxiety

Separation anxiety is a type of anxiety disorder that involves excessive distress and anxiety with separation. This may include separation from places or people to which they have a strong emotional connection with. It often affects children more than adults. Here is the latest research on separation anxiety.

Neural Activity: Imaging

Imaging of neural activity in vivo has developed rapidly recently with the advancement of fluorescence microscopy, including new applications using miniaturized microscopes (miniscopes). This feed follows the progress in this growing field.

Applications of Molecular Barcoding

The concept of molecular barcoding is that each original DNA or RNA molecule is attached to a unique sequence barcode. Sequence reads having different barcodes represent different original molecules, while sequence reads having the same barcode are results of PCR duplication from one original molecule. Discover the latest research on molecular barcoding here.