De novo clustering of long reads by gene from transcriptomics data

Nucleic Acids Research
Camille MarchetPierre Peterlongo


Long-read sequencing currently provides sequences of several thousand base pairs. It is therefore possible to obtain complete transcripts, offering an unprecedented vision of the cellular transcriptome. However the literature lacks tools for de novo clustering of such data, in particular for Oxford Nanopore Technologies reads, because of the inherent high error rate compared to short reads. Our goal is to process reads from whole transcriptome sequencing data accurately and without a reference genome in order to reliably group reads coming from the same gene. This de novo approach is therefore particularly suitable for non-model species, but can also serve as a useful pre-processing step to improve read mapping. Our contribution both proposes a new algorithm adapted to clustering of reads by gene and a practical and free access tool that allows to scale the complete processing of eukaryotic transcriptomes. We sequenced a mouse RNA sample using the MinION device. This dataset is used to compare our solution to other algorithms used in the context of biological clustering. We demonstrate that it is the best approach for transcriptomics long reads. When a reference is available to enable mapping, we show that it stands as an alter...Continue Reading


Aug 1, 1997·Computer Applications in the Biosciences : CABIOS·R Mott
Dec 31, 1997·Journal of Molecular Medicine : Official Organ of the Gesellschaft Deutscher Naturforscher Und Ärzte·G D Schuler
Apr 28, 1999·Trends in Genetics : TIG·J BouckK Worley
Dec 11, 1999·Nucleic Acids Research·J QuackenbushJ Upton
Jul 13, 2000·Proceedings of the National Academy of Sciences of the United States of America·N S HolterN V Fedoroff
Jan 11, 2000·Nucleic Acids Research·A ChristoffelsW Hide
Dec 26, 2001·Nature Genetics·Barmak Modrek, Christopher Lee
Apr 5, 2002·Genome Research·W James Kent
Jun 13, 2002·Proceedings of the National Academy of Sciences of the United States of America·M Girvan, M E J Newman
Mar 21, 2003·Nature·Eric E SchadtStephen H Friend
Nov 25, 2004·PLoS Biology·Graham E J RodwellStuart K Kim
Feb 25, 2005·Bioinformatics·Thomas D Wu, Colin K Watanabe
Feb 14, 2006·Bioinformatics·Balázs AdamcsekTamás Vicsek
Mar 21, 2006·Nucleic Acids Research·Alberto PaccanaroMansoor A S Saqi
Apr 6, 2007·Nature·Gergely PallaTamás Vicsek
May 19, 2010·IEEE/ACM Transactions on Computational Biology and Bioinformatics·Banu DostVineet Bafna
May 21, 2010·Physical Review. E, Statistical, Nonlinear, and Soft Matter Physics·Benjamin H GoodAaron Clauset
Aug 17, 2010·Bioinformatics·Robert C Edgar
Jan 12, 2011·Nature Biotechnology·James T RobinsonJill P Mesirov
May 17, 2011·Nature Biotechnology·Manfred G GrabherrAviv Regev
Jul 2, 2011·BMC Bioinformatics·Mohammadreza GhodsiMihai Pop
Aug 4, 2011·Bioinformatics·Ergude BaoThomas Girke
Oct 11, 2013·Nature Biotechnology·Donald SharonMichael Snyder
Nov 28, 2013·Proceedings of the National Academy of Sciences of the United States of America·Kin Fai AuWing Hung Wong
Aug 22, 2014·Bioinformatics·Nicholas J Loman, Aaron R Quinlan
Feb 2, 2015·Bioinformatics·Eduard ZoritaGuillaume J Filion
Apr 15, 2015·Cold Spring Harbor Protocols·Kimberly R Kukurba, Stephen B Montgomery
May 26, 2015·Nature Biotechnology·Konstantin BerlinAdam M Phillippy
May 31, 2015·Briefings in Bioinformatics·David LaehnemannAlice Carolyn McHardy
Oct 1, 2015·Genome Biology·Mohan T BolisettyBrenton R Graveley
Oct 9, 2015·Nature Biotechnology·Konstantin BerlinAdam M Phillippy
Feb 3, 2016·F1000Research·Camilla L C IpMinION Analysis and Reference Consortium
Apr 16, 2016·Nature Communications·Ivan SovićNiranjan Nagarajan
Jun 25, 2016·Nature Communications·Salah E Abdel-GhanyAnireddy S N Reddy
Jun 7, 2017·Molecular Informatics·Ze-Gang WeiYi-Zhai Zhang
Aug 5, 2017·Scientific Reports·Hans J JansenChristiaan V Henkel
Oct 11, 2017·Journal of Experimental Botany·Richard M Leggett, Matthew D Clark
Oct 14, 2017·The Plant Cell·Maximilian H-W SchmidtBjörn Usadel
Mar 31, 2018·Nature Reviews. Genetics·Fritz J SedlazeckMichael C Schatz


Aug 21, 2019·G3 : Genes - Genomes - Genetics·Dario I OjedaTanja Pyhäjärvi
Mar 18, 2020·Journal of Computational Biology : a Journal of Computational Molecular Cell Biology·Kristoffer Sahlin, Paul Medvedev
Sep 3, 2019·Frontiers in Genetics·Simon A HardwickHagen Tilgner
Aug 1, 2020·Frontiers in Genetics·Spyros OikonomopoulosJiannis Ragoussis
Jan 13, 2021·PLoS Computational Biology·Bansho MasutaniShinichi Morishita
Jan 31, 2021·The Journal of Chemical Physics·Ángel Díaz CarralMaria Fyta
Jan 6, 2021·Nature Communications·Kristoffer Sahlin, Paul Medvedev
Feb 20, 2021·G3 : Genes - Genomes - Genetics·James G Baldwin-BrownMichael D Shapiro

Datasets Mentioned


Methods Mentioned


Related Concepts

Sequence Determinations, DNA
MRNA Differential Display
Mouse, Swiss
High-Throughput Nucleotide Sequencing
Gene Expression Profiles
Gene Clusters

Trending Feeds


Coronaviruses encompass a large family of viruses that cause the common cold as well as more serious diseases, such as the ongoing outbreak of coronavirus disease 2019 (COVID-19; formally known as 2019-nCoV). Coronaviruses can spread from animals to humans; symptoms include fever, cough, shortness of breath, and breathing difficulties; in more severe cases, infection can lead to death. This feed covers recent research on COVID-19.

Alzheimer's Disease: MS4A

Variants within the membrane-spanning 4-domains subfamily A (MS4A) gene cluster have recently been implicated in Alzheimer's disease in genome-wide association studies. Here is the latest research on Alzheimer's disease and MS4A.

Pediculosis pubis

Pediculosis pubis is a disease caused by a parasitic insect known as Pthirus pubis, which infests human pubic hair, as well as other areas with hair including eye lashes. Here is the latest research.

Rh Isoimmunization

Rh isoimmunization is a potentially preventable condition that occasionally is associated with significant perinatal morbidity or mortality. Discover the latest research on Rh Isoimmunization here.

Genetic Screens in iPSC-derived Brain Cells

Genetic screening is a critical tool that can be employed to define and understand gene function and interaction. This feed focuses on genetic screens conducted using induced pluripotent stem cell (iPSC)-derived brain cells. It also follows CRISPR-Cas9 approaches to generating genetic mutants as a means of understanding the effect of genetics on phenotype.

Enzyme Evolution

This feed focuses on molecular models of enzyme evolution and new approaches (such as adaptive laboratory evolution) to metabolic engineering of microorganisms. Here is the latest research.

Chronic Fatigue Syndrome

Chronic fatigue syndrome is a disease characterized by unexplained disabling fatigue; the pathology of which is incompletely understood. Discover the latest research on chronic fatigue syndrome here.

Pharmacology of Proteinopathies

This feed focuses on the pharmacology of proteinopathies - diseases in which proteins abnormally aggregate (i.e. Alzheimer’s, Parkinson’s, etc.). Discover the latest research in this field with this feed.

Alignment-free Sequence Analysis Tools

Alignment-free sequence analyses have been applied to problems ranging from whole-genome phylogeny to the classification of protein families, identification of horizontally transferred genes, and detection of recombined sequences. Here is the latest research.