Abstract

Abstract We present a concept and formalism, the string graph, which represents all that is inferable about a DNA sequence from a collection of shotgun sequencing reads collected from it. We give time and space efficient algorithms for constructing a string graph given the collection of overlaps between the reads and, in particular, present a novel linear expected time algorithm for transitive reduction in this context. The result demonstrates that the decomposition of reads into kmers employed in the de Bruijn graph approach described earlier is not essential, and exposes its close connection to the unitig approach we developed at Celera. This paper is a preliminary piece giving the basic algorithm and results that demonstrate the efficiency and scalability of the method. These ideas are being used to build a next-generation whole genome assembler called BOA (Berkeley Open Assembler) that will easily scale to mammalian genomes. Contact: gene@eecs.berkeley.edu

Keywords

Fragment (logic)Computer scienceGraphString (physics)Theoretical computer scienceComputational biologyProgramming languageBiologyPhysicsTheoretical physics

Affiliated Institutions

Related Publications

Publication Info

Year
2005
Type
article
Volume
21
Issue
suppl_2
Pages
ii79-ii85
Citations
431
Access
Closed

External Links

Social Impact

Altmetric

Social media, news, blog, policy document mentions

Citation Metrics

431
OpenAlex

Cite This

Eugene W. Myers (2005). The fragment assembly string graph. Bioinformatics , 21 (suppl_2) , ii79-ii85. https://doi.org/10.1093/bioinformatics/bti1114

Identifiers

DOI
10.1093/bioinformatics/bti1114