Abstract

Accurate genome assembly is hampered by repetitive regions. Although long single molecule sequencing reads are better able to resolve genomic repeats than short-read data, most long-read assembly algorithms do not provide the repeat characterization necessary for producing optimal assemblies. Here, we present Flye, a long-read assembly algorithm that generates arbitrary paths in an unknown repeat graph, called disjointigs, and constructs an accurate repeat graph from these error-riddled disjointigs. We benchmark Flye against five state-of-the-art assemblers and show that it generates better or comparable assemblies, while being an order of magnitude faster. Flye nearly doubled the contiguity of the human genome assembly (as measured by the NGA50 assembly quality metric) compared with existing assemblers. Flye improves the speed and accuracy of genome assembly by using repeat graphs to resolve repeat regions.

Keywords

Sequence assemblyContiguityComputer scienceHybrid genome assemblyGraphMetric (unit)AlgorithmBenchmark (surveying)Computational biologyGenomeHuman genomeTheoretical computer scienceBiologyGeneticsGeneEngineering

MeSH Terms

AlgorithmsGenomeBacterialGenomeHumanGenomicsHigh-Throughput Nucleotide SequencingHumansMolecular Sequence AnnotationRepetitive SequencesNucleic AcidSequence AnalysisDNASoftware

Affiliated Institutions

Related Publications

Publication Info

Year
2019
Type
article
Volume
37
Issue
5
Pages
540-546
Citations
5451
Access
Closed

Social Impact

Social media, news, blog, policy document mentions

Citation Metrics

5451
OpenAlex
800
Influential

Cite This

Mikhail Kolmogorov, Jeffrey Yuan, Yu Lin et al. (2019). Assembly of long, error-prone reads using repeat graphs. Nature Biotechnology , 37 (5) , 540-546. https://doi.org/10.1038/s41587-019-0072-8

Identifiers

DOI
10.1038/s41587-019-0072-8
PMID
30936562

Data Quality

Data completeness: 90%