MetaCAA: A clustering-aided methodology for efficient assembly of metagenomic datasets

Rachamalla Maheedhar ReddySharmila S Mande


A key challenge in analyzing metagenomics data pertains to assembly of sequenced DNA fragments (i.e. reads) originating from various microbes in a given environmental sample. Several existing methodologies can assemble reads originating from a single genome. However, these methodologies cannot be applied for efficient assembly of metagenomic sequence datasets. In this study, we present MetaCAA - a clustering-aided methodology which helps in improving the quality of metagenomic sequence assembly. MetaCAA initially groups sequences constituting a given metagenome into smaller clusters. Subsequently, sequences in each cluster are independently assembled using CAP3, an existing single genome assembly program. Contigs formed in each of the clusters along with the unassembled reads are then subjected to another round of assembly for generating the final set of contigs. Validation using simulated and real-world metagenomic datasets indicates that MetaCAA aids in improving the overall quality of assembly. A software implementation of MetaCAA is available at


Oct 6, 1999·Genome Research·X Huang, A Madan
Mar 24, 2000·Science·Eugene W MyersJ C Venter
Aug 16, 2001·Proceedings of the National Academy of Sciences of the United States of America·P A PevznerM S Waterman
Jan 10, 2002·Genome Research·Serafim BatzoglouEric S Lander
Oct 7, 2003·Annual Review of Microbiology·Michael S Rappé, Stephen J Giovannoni
Aug 2, 2005·Nature·Marcel MarguliesJonathan M Rothberg
Dec 13, 2006·Bioinformatics·René L WarrenRobert A Holt
May 1, 2007·Nature Methods·Konstantinos MavromatisNikos C Kyrpides
Sep 26, 2007·Bioinformatics·William R JeckCorbin D Jones
Mar 15, 2008·Genome Research·Jonathan ButlerDavid B Jaffe
Mar 20, 2008·Genome Research·Daniel R Zerbino, Ewan Birney
Mar 3, 2009·Genome Research·Jared T SimpsonInanç Birol
Jun 2, 2009·Briefings in Bioinformatics·Mihai Pop
Mar 3, 2010·PLoS Computational Biology·John C WooleyIddo Friedberg
Mar 10, 2010·Genomics·Jason R MillerGranger Sutton
Jun 21, 2011·Bioinformatics·Yu PengFrancis Y L Chin
Jul 24, 2012·Nucleic Acids Research·Toshiaki NamikiYasubumi Sakakibara


Dec 4, 2015·International Journal of Molecular Sciences·Rafael R C CuadratAlberto M R Dávila
Jun 24, 2015·Research in Microbiology·Cyrielle GascPierre Peyret
Oct 28, 2015·Nucleic Acids Research·Julia H WildschutteJeffrey M Kidd
Jan 8, 2015·Frontiers in Microbiology·Saskia L SmitsAnita C Schürch
Aug 10, 2017·Nature Reviews. Gastroenterology & Hepatology·Marcus J ClaessonPaul W O'Toole
Mar 24, 2021·Briefings in Bioinformatics·Masood Ur Rehman KayaniLei Chen

Related Concepts

Computer Programs and Programming
Sequence Determinations, DNA
Datasets as Topic
Gene Clusters
Computer Software

Trending Feeds


Coronaviruses encompass a large family of viruses that cause the common cold as well as more serious diseases, such as the ongoing outbreak of coronavirus disease 2019 (COVID-19; formally known as 2019-nCoV). Coronaviruses can spread from animals to humans; symptoms include fever, cough, shortness of breath, and breathing difficulties; in more severe cases, infection can lead to death. This feed covers recent research on COVID-19.

Alzheimer's Disease: MS4A

Variants within the membrane-spanning 4-domains subfamily A (MS4A) gene cluster have recently been implicated in Alzheimer's disease in genome-wide association studies. Here is the latest research on Alzheimer's disease and MS4A.

Pediculosis pubis

Pediculosis pubis is a disease caused by a parasitic insect known as Pthirus pubis, which infests human pubic hair, as well as other areas with hair including eye lashes. Here is the latest research.

Rh Isoimmunization

Rh isoimmunization is a potentially preventable condition that occasionally is associated with significant perinatal morbidity or mortality. Discover the latest research on Rh Isoimmunization here.

Genetic Screens in iPSC-derived Brain Cells

Genetic screening is a critical tool that can be employed to define and understand gene function and interaction. This feed focuses on genetic screens conducted using induced pluripotent stem cell (iPSC)-derived brain cells. It also follows CRISPR-Cas9 approaches to generating genetic mutants as a means of understanding the effect of genetics on phenotype.

Enzyme Evolution

This feed focuses on molecular models of enzyme evolution and new approaches (such as adaptive laboratory evolution) to metabolic engineering of microorganisms. Here is the latest research.

Chronic Fatigue Syndrome

Chronic fatigue syndrome is a disease characterized by unexplained disabling fatigue; the pathology of which is incompletely understood. Discover the latest research on chronic fatigue syndrome here.

Pharmacology of Proteinopathies

This feed focuses on the pharmacology of proteinopathies - diseases in which proteins abnormally aggregate (i.e. Alzheimer’s, Parkinson’s, etc.). Discover the latest research in this field with this feed.

Alignment-free Sequence Analysis Tools

Alignment-free sequence analyses have been applied to problems ranging from whole-genome phylogeny to the classification of protein families, identification of horizontally transferred genes, and detection of recombined sequences. Here is the latest research.