Abstract
Abstract Motivation: Comparison of multimegabase genomic DNA sequences is a popular technique for finding and annotating conserved genome features. Performing such comparisons entails finding many short local alignments between sequences up to tens of megabases in length. To process such long sequences efficiently, existing algorithms find alignments by expanding around short runs of matching bases with no substitutions or other differences. Unfortunately, exact matches that are short enough to occur often in significant alignments also occur frequently by chance in the background sequence. Thus, these algorithms must trade off between efficiency and sensitivity to features without long exact matches. Results: We introduce a new algorithm, lsh-all-pairs, to find ungapped local alignments in genomic sequence with up to a specified fraction of substitutions. The length and substitution rate of these alignments can be chosen so that they appear frequently in significant similarities yet still remain rare in the background sequence. The algorithm finds ungapped alignments efficiently using a randomized search technique, locality-sensitive hashing. We have found lsh-all-pairsto be both efficient and sensitive for finding local similarities with as little as 63% identity in mammalian genomic sequences up to tens of megabases in length Availability: Contact the author at the address below. Contact: jbuhler@cs.washington.edu Supplementary information: The sequences and local alignment data described in this work are available at http://bio.cs.washington.edu/jbuhler-bioinformatics-2001/.
Keywords
Affiliated Institutions
Related Publications
Fast and accurate long-read alignment with Burrows–Wheeler transform
Abstract Motivation: Many programs for aligning short sequencing reads to a reference genome have been developed in the last 2 years. Most of them are very efficient for short r...
Aligning two sequences within a specified diagonal band
We describe an algorithm for aligning two sequences within a diagonal band that requires only O(NW) computation time and O(N) space, where N is the length of the shorter of the ...
ParAlign: a parallel sequence alignment algorithm for rapid and sensitive database searches
There is a need for faster and more sensitive algorithms for sequence similarity searching in view of the rapidly increasing amounts of genomic sequence data available. Parallel...
Automated assembly of protein blocks for database searching
A system is described for finding and assembling the most highly conserved regions of related proteins for database searching. First, an automated version of Smith's algorithm f...
The Multiple Sequence Alignment Problem in Biology
The study and comparison of sequences of characters from a finite alphabet is relevant to various areas of science, notably molecular biology. The measurement of sequence simila...
Publication Info
- Year
- 2001
- Type
- article
- Volume
- 17
- Issue
- 5
- Pages
- 419-428
- Citations
- 261
- Access
- Closed
External Links
Social Impact
Social media, news, blog, policy document mentions
Citation Metrics
Cite This
Identifiers
- DOI
- 10.1093/bioinformatics/17.5.419