Abstract

We describe an algorithm, SSAHA ( S equence S earch and A lignment by H ashing A lgorithm), for performing fast searches on databases containing multiple gigabases of DNA. Sequences in the database are preprocessed by breaking them into consecutive k -tuples of k contiguous bases and then using a hash table to store the position of each occurrence of each k -tuple. Searching for a query sequence in the database is done by obtaining from the hash table the “hits” for each k -tuple in the query sequence and then performing a sort on the results. We discuss the effect of the tuple length k on the search speed, memory usage, and sensitivity of the algorithm and present the results of computational experiments which show that SSAHA can be three to four orders of magnitude faster than BLAST or FASTA , while requiring less memory than suffix tree methods. The SSAHA algorithm is used for high-throughput single nucleotide polymorphism (SNP) detection and very large scale sequence assembly. Also, it provides Web-based sequence search facilities for Ensembl projects.

Keywords

Hash tableTupleHash functionSequence databaseComputer scienceEnsemblSequence (biology)Suffix arrayDatabase indexTable (database)Information retrievalData miningDatabaseAlgorithmBiologyData structureSearch engine indexingMathematicsGeneticsGenomeGenomicsGene

Affiliated Institutions

Related Publications

Publication Info

Year
2001
Type
article
Volume
11
Issue
10
Pages
1725-1729
Citations
962
Access
Closed

External Links

Social Impact

Social media, news, blog, policy document mentions

Citation Metrics

962
OpenAlex

Cite This

Zemin Ning, Anthony J. Cox, James C. Mullikin (2001). SSAHA: A Fast Search Method for Large DNA Databases. Genome Research , 11 (10) , 1725-1729. https://doi.org/10.1101/gr.194201

Identifiers

DOI
10.1101/gr.194201