K-mer clustering algorithm using a MapReduce framework: application to the parallelization of the Inchworm module of Trinity

BMC Bioinformatics
Chang Sik KimKirk E Jordan

Abstract

De novo transcriptome assembly is an important technique for understanding gene expression in non-model organisms. Many de novo assemblers using the de Bruijn graph of a set of the RNA sequences rely on in-memory representation of this graph. However, current methods analyse the complete set of read-derived k-mer sequence at once, resulting in the need for computer hardware with large shared memory. We introduce a novel approach that clusters k-mers as the first step. The clusters correspond to small sets of gene products, which can be processed quickly to give candidate transcripts. We implement the clustering step using the MapReduce approach for parallelising the analysis of large datasets, which enables the use of compute clusters. The computational task is distributed across the compute system using the industry-standard MPI protocol, and no specialised hardware is required. Using this approach, we have re-implemented the Inchworm module from the widely used Trinity pipeline, and tested the method in the context of the full Trinity pipeline. Validation tests on a range of real datasets show large reductions in the runtime and per-node memory requirements, when making use of a compute cluster. Our study shows that MapReduce...Continue Reading

References

Mar 20, 2008·Genome Research·Daniel R Zerbino, Ewan Birney
Nov 4, 2008·Nature·Eric T WangChristopher B Burge
Feb 12, 2009·Proceedings of the National Academy of Sciences of the United States of America·Moran YassourAviv Regev
Mar 3, 2009·Genome Research·Jared T SimpsonInanç Birol
Jun 17, 2009·Bioinformatics·Inanç BirolSteven J M Jones
Dec 22, 2009·Bioinformatics·Bo LiColin N Dewey
Dec 24, 2010·Genome Biology·Alicia OshlackMatthew D Young
Jan 11, 2011·Bioinformatics·Guillaume Marçais, Carl Kingsford
May 17, 2011·Nature Biotechnology·Manfred G GrabherrAviv Regev
Jun 21, 2011·Bioinformatics·Yu PengFrancis Y L Chin
Dec 7, 2011·Methods in Molecular Biology·Preethi H GunaratneArpit Tandon
Aug 1, 2012·Proceedings of the National Academy of Sciences of the United States of America·Jason PellC Titus Brown
Sep 11, 2012·Biochemical and Biophysical Research Communications·Baomin XuChunyan Li
Oct 12, 2012·Frontiers in Plant Science·Simon SchlieskyAndrea Bräutigam
Mar 1, 2013·Nature·Sebastian MemczakNikolaus Rajewsky
Jul 6, 2013·Cell·Igor Ulitsky, David P Bartel
Jul 9, 2013·Journal of Computational Biology : a Journal of Computational Molecular Cell Biology·Henry C M LeungFrancis Y L Chin
Sep 17, 2013·Bioinformatics·Brent S PedersenSubhajyoti De
Sep 18, 2013·Algorithms for Molecular Biology : AMB·Rayan Chikhi, Guillaume Rizk
Feb 18, 2014·Bioinformatics·Yinlong XieJun Wang
Jan 23, 2015·Genome Biology·Bo LiColin N Dewey
Feb 28, 2015·Genome Biology·Zheng ChangXiuzhen Huang
Feb 23, 2017·PeerJ·Cédric CabauChristophe Klopp

❮ Previous
Next ❯

Citations

Mar 28, 2019·Scientific Reports·Dilip A Durai, Marcel H Schulz

❮ Previous
Next ❯

Methods Mentioned

BETA
RNA-Seq

Software Mentioned

iDataplex
nextscale
Trans
CruzDB
kmer
simulate
Jellyfish
MapReduce
Velvet
REF

Related Concepts

Trending Feeds

COVID-19

Coronaviruses encompass a large family of viruses that cause the common cold as well as more serious diseases, such as the ongoing outbreak of coronavirus disease 2019 (COVID-19; formally known as 2019-nCoV). Coronaviruses can spread from animals to humans; symptoms include fever, cough, shortness of breath, and breathing difficulties; in more severe cases, infection can lead to death. This feed covers recent research on COVID-19.

Blastomycosis

Blastomycosis fungal infections spread through inhaling Blastomyces dermatitidis spores. Discover the latest research on blastomycosis fungal infections here.

Nuclear Pore Complex in ALS/FTD

Alterations in nucleocytoplasmic transport, controlled by the nuclear pore complex, may be involved in the pathomechanism underlying multiple neurodegenerative diseases including Amyotrophic Lateral Sclerosis and Frontotemporal Dementia. Here is the latest research on the nuclear pore complex in ALS and FTD.

Applications of Molecular Barcoding

The concept of molecular barcoding is that each original DNA or RNA molecule is attached to a unique sequence barcode. Sequence reads having different barcodes represent different original molecules, while sequence reads having the same barcode are results of PCR duplication from one original molecule. Discover the latest research on molecular barcoding here.

Chronic Fatigue Syndrome

Chronic fatigue syndrome is a disease characterized by unexplained disabling fatigue; the pathology of which is incompletely understood. Discover the latest research on chronic fatigue syndrome here.

Evolution of Pluripotency

Pluripotency refers to the ability of a cell to develop into three primary germ cell layers of the embryo. This feed focuses on the mechanisms that underlie the evolution of pluripotency. Here is the latest research.

Position Effect Variegation

Position Effect Variagation occurs when a gene is inactivated due to its positioning near heterochromatic regions within a chromosome. Discover the latest research on Position Effect Variagation here.

STING Receptor Agonists

Stimulator of IFN genes (STING) are a group of transmembrane proteins that are involved in the induction of type I interferon that is important in the innate immune response. The stimulation of STING has been an active area of research in the treatment of cancer and infectious diseases. Here is the latest research on STING receptor agonists.

Microbicide

Microbicides are products that can be applied to vaginal or rectal mucosal surfaces with the goal of preventing, or at least significantly reducing, the transmission of sexually transmitted infections. Here is the latest research on microbicides.