Assembly of long, error-prone reads using repeat graphs

Abstract

Accurate genome assembly is hampered by repetitive regions. Although long single molecule sequencing reads are better able to resolve genomic repeats than short-read data, most long-read assembly algorithms do not provide the repeat characterization necessary for producing optimal assemblies. Here, we present Flye, a long-read assembly algorithm that generates arbitrary paths in an unknown repeat graph, called disjointigs, and constructs an accurate repeat graph from these error-riddled disjointigs. We benchmark Flye against five state-of-the-art assemblers and show that it generates better or comparable assemblies, while being an order of magnitude faster. Flye nearly doubled the contiguity of the human genome assembly (as measured by the NGA50 assembly quality metric) compared with existing assemblers. Flye improves the speed and accuracy of genome assembly by using repeat graphs to resolve repeat regions.

Keywords

Sequence assemblyContiguityComputer scienceHybrid genome assemblyGraphMetric (unit)AlgorithmBenchmark (surveying)Computational biologyGenomeHuman genomeTheoretical computer scienceBiologyGeneticsGeneEngineering

MeSH Terms

AlgorithmsGenomeBacterialGenomeHumanGenomicsHigh-Throughput Nucleotide SequencingHumansMolecular Sequence AnnotationRepetitive SequencesNucleic AcidSequence AnalysisDNASoftware

Affiliated Institutions

Related Publications

IDBA-UD: a <i>de novo</i> assembler for single-cell and metagenomic sequencing data with highly uneven depth

Yu Peng , Henry C. M. Leung , Siu‐Ming Yiu +1 more

Abstract Motivation: Next-generation sequencing allows us to sequence reads from a microbial environment using single-cell sequencing or metagenomic sequencing technologies. How...

2012 Bioinformatics 3099 citations

Whole-Genome Sequencing and Assembly with High-Throughput, Short-Read Technologies

Andreas Sundquist , Mostafa Ronaghi , Haixu Tang +2 more

While recently developed short-read sequencing technologies may dramatically reduce the sequencing cost and eventually achieve the $1000 goal for re-sequencing, their limitation...

2007 PLoS ONE 126 citations

GAGE: A critical evaluation of genome assemblies and assembly algorithms

Steven L. Salzberg , Adam M. Phillippy , Aleksey V. Zimin +10 more

New sequencing technology has dramatically altered the landscape of whole-genome sequencing, allowing scientists to initiate numerous projects to decode the genomes of previousl...

2011 Genome Research 733 citations

SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing

Anton Bankevich , Sergey Nurk , Dmitry Antipov +13 more

The lion's share of bacteria in various environments cannot be cloned in the laboratory and thus cannot be sequenced using existing technologies. A major goal of single-cell gen...

2012 Journal of Computational Biology 25356 citations

Velvet: Algorithms for de novo short read assembly using de Bruijn graphs

Daniel R. Zerbino , Ewan Birney

We have developed a new set of algorithms, collectively called “Velvet,” to manipulate de Bruijn graphs for genomic sequence assembly. A de Bruijn graph is a compact representat...

2008 Genome Research 9539 citations

Publication Info

Year: 2019
Type: article
Volume: 37
Issue: 5
Pages: 540-546
Citations: 5451
Access: Closed

External Links

Download PDF (Free) View on DOI.org PubMed Semantic Scholar

Social Impact

Altmetric

Assembly of long, error-prone reads using repeat graphs

PlumX Metrics

Social media, news, blog, policy document mentions

Citation Metrics

5451

OpenAlex

800

Influential

Cite This

APA Style

                            
                                
                                    Mikhail Kolmogorov, 
                                
                                    Jeffrey Yuan, 
                                
                                    Yu Lin
                                
                                et al.
                            
                            (2019). 
                            Assembly of long, error-prone reads using repeat graphs. 
                            Nature Biotechnology
                            , 37
                            (5)
                            , 540-546.
                            https://doi.org/10.1038/s41587-019-0072-8
                        

Identifiers

DOI: 10.1038/s41587-019-0072-8
PMID: 30936562

Data Quality

Data completeness: 90%