Similarity Search in High Dimensions via Hashing

Abstract

The nearest- or near-neighbor query problems arise in a large variety of database applications, usually in the context of similarity searching. Of late, there has been increasing interest in building search/index structures for performing similarity search over high-dimensional data, e.g., image databases, document collections, time-series databases, and genome databases. Unfortunately, all known techniques for solving this problem fall prey to the &quot;curse of dimensionality.&quot; That is, the data structures scale poorly with data dimensionality; in fact, if the number of dimensions exceeds 10 to 20, searching in k-d trees and related structures involves the inspection of a large fraction of the database, thereby doing no better than brute-force linear search. It has been suggested that since the selection of features and the choice of a distance metric in typical applications is rather heuristic, determining an approximate nearest neighbor should suffice for most practic...

Keywords

Nearest neighbor searchComputer scienceHash functionContext (archaeology)Similarity (geometry)Information retrievalData miningLocality-sensitive hashingCurse of dimensionalityTheoretical computer scienceDatabaseHash tableArtificial intelligence

Affiliated Institutions

Stanford University US

Related Publications

A Global Geometric Framework for Nonlinear Dimensionality Reduction

Joshua B. Tenenbaum , Vin de Silva , John Langford

Scientists working with large volumes of high-dimensional data, such as global climate patterns, stellar spectra, or human gene distributions, regularly confront the problem of ...

2000 Science 13453 citations

Example-based super-resolution

William T. Freeman , Thouis R. Jones , Egon Pasztor

We call methods for achieving high-resolution enlargements of pixel-based images super-resolution algorithms. Many applications in graphics or image processing could benefit fro...

2002 IEEE Computer Graphics and Applications 2502 citations

New Powder Diffraction File (PDF-4) in relational database format: advantages and data-mining capabilities

S. Kabekkodu , J. Faber , Tim Fawcett

The International Centre for Diffraction Data (ICDD) is responding to the changing needs in powder diffraction and materials analysis by developing the Powder Diffraction File (...

2002 Acta Crystallographica Section B Stru... 99 citations

Gapped BLAST and PSI-BLAST: a new generation of protein database search programs

Stephen F. Altschul

The BLAST programs are widely used tools for searching protein and DNA databases for sequence similarities. For protein comparisons, a variety of definitional, algorithmic and s...

1997 Nucleic Acids Research 73388 citations

Dimensionality reduction for visualizing single-cell data using UMAP

Étienne Becht , Leland McInnes , John Healy +5 more

Advances in single-cell technologies have enabled high-resolution dissection of tissue composition. Several tools for dimensionality reduction are available to analyze the large...

2018 Nature Biotechnology 5349 citations

Publication Info

Year: 1999
Type: article
Pages: 518-529
Citations: 3096
Access: Closed

External Links

Citation Metrics

3096

OpenAlex

Cite This

APA Style

                            
                                    Aristides Gionis, 
                                
                                    Piotr Indyk, 
                                
                                    Rajeev Motwani
                                
                            (1999). 
                            Similarity Search in High Dimensions via Hashing. 
                            
                            , 518-529.