SSAHA: A Fast Search Method for Large DNA Databases

Zemin Ning; Anthony J. Cox; James C. Mullikin

doi:10.1101/gr.194201

Abstract

We describe an algorithm, SSAHA ( S equence S earch and A lignment by H ashing A lgorithm), for performing fast searches on databases containing multiple gigabases of DNA. Sequences in the database are preprocessed by breaking them into consecutive k -tuples of k contiguous bases and then using a hash table to store the position of each occurrence of each k -tuple. Searching for a query sequence in the database is done by obtaining from the hash table the “hits” for each k -tuple in the query sequence and then performing a sort on the results. We discuss the effect of the tuple length k on the search speed, memory usage, and sensitivity of the algorithm and present the results of computational experiments which show that SSAHA can be three to four orders of magnitude faster than BLAST or FASTA , while requiring less memory than suffix tree methods. The SSAHA algorithm is used for high-throughput single nucleotide polymorphism (SNP) detection and very large scale sequence assembly. Also, it provides Web-based sequence search facilities for Ensembl projects.

Keywords

Hash tableTupleHash functionSequence databaseComputer scienceEnsemblSequence (biology)Suffix arrayDatabase indexTable (database)Information retrievalData miningDatabaseAlgorithmBiologyData structureSearch engine indexingMathematicsGeneticsGenomeGenomicsGene

Affiliated Institutions

Wellcome Sanger Institute GB

Related Publications

A fast, lock-free approach for efficient parallel counting of occurrences of <i>k</i> -mers

Guillaume Marçais , Carl Kingsford

Abstract Motivation: Counting the number of occurrences of every k-mer (substring of length k) in a long string is a central subproblem in many applications, including genome as...

2011 Bioinformatics 4605 citations

Similarity Search in High Dimensions via Hashing

Aristides Gionis , Piotr Indyk , Rajeev Motwani

The nearest- or near-neighbor query problems arise in a large variety of database applications, usually in the context of similarity searching. Of late, there has been increasin...

1999 3096 citations

MMseqs software suite for fast and deep clustering and searching of large protein sequence sets

Maria Hauser , Martin Steinegger , Johannes Söding

Abstract Motivation: Sequence databases are growing fast, challenging existing analysis pipelines. Reducing the redundancy of sequence databases by similarity clustering improve...

2016 Bioinformatics 276 citations

GHOSTX: An Improved Sequence Homology Search Algorithm Using a Query Suffix Array and a Database Suffix Array

Shuji Suzuki , Masanori Kakuta , Takashi Ishida +1 more

DNA sequences are translated into protein coding sequences and then further assigned to protein families in metagenomic analyses, because of the need for sensitivity. However, h...

2014 PLoS ONE 91 citations

UniProt archive

Rasko Leinonen , Federico Garcia Diez , David Binns +3 more

Abstract Summary: UniProt Archive (UniParc) is the most comprehensive, non-redundant protein sequence database available. Its protein sequences are retrieved from predominant, p...

2004 Bioinformatics 209 citations

Publication Info

Year: 2001
Type: article
Volume: 11
Issue: 10
Pages: 1725-1729
Citations: 962
Access: Closed

External Links

View on DOI.org

Social Impact

Altmetric

SSAHA: A Fast Search Method for Large DNA Databases

PlumX Metrics

Social media, news, blog, policy document mentions

Citation Metrics

962

OpenAlex

Cite This

APA Style

                            
                                    Zemin Ning, 
                                
                                    Anthony J. Cox, 
                                
                                    James C. Mullikin
                                
                            (2001). 
                            SSAHA: A Fast Search Method for Large DNA Databases. 
                            Genome Research
                            , 11
                            (10)
                            , 1725-1729.
                            https://doi.org/10.1101/gr.194201

Identifiers

DOI: 10.1101/gr.194201