Abstract
Abstract We present a concept and formalism, the string graph, which represents all that is inferable about a DNA sequence from a collection of shotgun sequencing reads collected from it. We give time and space efficient algorithms for constructing a string graph given the collection of overlaps between the reads and, in particular, present a novel linear expected time algorithm for transitive reduction in this context. The result demonstrates that the decomposition of reads into kmers employed in the de Bruijn graph approach described earlier is not essential, and exposes its close connection to the unitig approach we developed at Celera. This paper is a preliminary piece giving the basic algorithm and results that demonstrate the efficiency and scalability of the method. These ideas are being used to build a next-generation whole genome assembler called BOA (Berkeley Open Assembler) that will easily scale to mammalian genomes. Contact: gene@eecs.berkeley.edu
Keywords
Affiliated Institutions
Related Publications
An Eulerian path approach to DNA fragment assembly
For the last 20 years, fragment assembly in DNA sequencing followed the “overlap–layout–consensus” paradigm that is used in all currently available assembly tools. Although this...
A parallel graph decomposition algorithm for DNA sequencing with nanopores
Abstract Motivation: With the potential availability of nanopore devices that can sense the bases of translocating single-stranded DNA (ssDNA), it is likely that ‘reads’ of leng...
Velvet: Algorithms for de novo short read assembly using de Bruijn graphs
We have developed a new set of algorithms, collectively called “Velvet,” to manipulate de Bruijn graphs for genomic sequence assembly. A de Bruijn graph is a compact representat...
IDBA-UD: a <i>de novo</i> assembler for single-cell and metagenomic sequencing data with highly uneven depth
Abstract Motivation: Next-generation sequencing allows us to sequence reads from a microbial environment using single-cell sequencing or metagenomic sequencing technologies. How...
Colored de Bruijn Graphs and the Genome Halving Problem
Breakpoint graph analysis is a key algorithmic technique in studies of genome rearrangements. However, breakpoint graphs are defined only for genomes without duplicated genes, t...
Publication Info
- Year
- 2005
- Type
- article
- Volume
- 21
- Issue
- suppl_2
- Pages
- ii79-ii85
- Citations
- 431
- Access
- Closed
External Links
Social Impact
Social media, news, blog, policy document mentions
Citation Metrics
Cite This
Identifiers
- DOI
- 10.1093/bioinformatics/bti1114