Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Similarity Search in High
Dimensions via Hashing
Aristides Gionis, Piotr Indyk, Rajeev Motwani
Presented by:
Fatih Uzun
Outline
•
•
•
•
•
Introduction
Problem Description
Key Idea
Experiments and Results
Conclusions
Introduction
• Similarity Search over High-Dimensional Data
– Image databases, document collections etc
• Curse of Dimensionality
– All space partitioning techniques degrade to linear
search for high dimensions
• Exact vs. Approximate Answer
– Approximate might be good-enough and much-faster
– Time-quality trade-off
Problem Description
• - Nearest Neighbor Search ( - NNS)
– Given a set P of points in a normed space , preprocess P
so as to efficiently return a point p P for any given
query point q, such that
• dist(q,p) (1 + ) min r P dist(q,r)
• Generalizes to K- nearest neighbor search ( K >1)
Problem Description
Key Idea
• Locality Sensitive Hashing ( LSH ) to get
sub-linear dependence on the data-size for
high-dimensional data
• Preprocessing :
– Hash the data-point using several LSH
functions so that probability of collision is
higher for closer objects
Algorithm : Preprocessing
• Input
– Set of N points { p1 , …….. pn }
– L ( number of hash tables )
• Output
– Hash tables Ti , i = 1 , 2, …. L
• Foreach
i = 1 , 2, …. L
– Initialize Ti with a random hash function gi(.)
• Foreach
i = 1 , 2, …. L
Foreach j = 1 , 2, …. N
Store point pj on bucket gi(pj) of hash table Ti
LSH - Algorithm
P
pi
g1(pi)
T1
g2(pi)
T2
gL(pi)
TL
Algorithm : - NNS Query
• Input
– Query point q
– K ( number of approx. nearest neighbors )
• Access
– Hash tables Ti , i = 1 , 2, …. L
• Output
– Set S of K ( or less ) approx. nearest neighbors
• S
Foreach
i = 1 , 2, …. L
– S S { points found in gi(q) bucket of hash table Ti }
LSH - Analysis
• Family H of (r1, r2, p1, p2)-sensitive functions,
{hi(.)}
– dist(p,q) < r1 ProbH [h(q) = h(p)] p1
– dist(p,q) r2 ProbH [h(q) = h(p)] p2
– p1 > p2 and r1 < r2
• LSH functions: gi(.) = { h1(.) …hk(.) }
• For a proper choice of k and l, a simpler problem, (r,)Neighbor, and hence the actual problem can be solved
• Query Time : O(d n[1/(1+)] )
– d : dimensions , n : data size
Experiments
• Data Sets
– Color images from COREL Draw library
(20,000 points,dimensions up to 64)
– Texture information of aerial photographs
(270,000 points, dimensions 60)
• Evaluation
– Speed, Miss Ratio, Error (%) for various data sizes,
dimensions, and K values
– Compare Performance with SR-Tree ( Spatial Data
Structure )
Performance Measures
• Speed
– Number of disk block accesses in order to answer the
query ( # hash tables)
• Miss Ratio
– Fraction of cases when less than K points are found for
K-NNS
• Error
– Average of fractional error in distance to point found by
LSH as compared to nearest neighbor distance taken
over entire set of queries
Speed vs. Data Size
Approximate 1 - NNS
20
18
Disk Accesses
16
14
LSH, error = 0.2
12
LSH, error = 0.1
10
LSH, error = 0.05
8
LSH, error =0.02
6
SR-Tree
4
2
0
0
5000
10000
15000
Number of Database Points
20000
Speed vs. Dimension
Approximate 1-NNS
20
18
Disk Accesses
16
14
LSH , Error = 0.2
12
LSH, Error = 0.1
10
LSH, Error = 0.05
8
LSH, Error = 0.02
6
SR- Tree
4
2
0
0
20
40
Dimensions
60
80
Speed vs. Nearest Neighbors
Approximate K-NNS
16
Disk Accesses
14
12
10
LSH, Error 0.2
8
LSH, Error 0.1
6
LSH, Error 0.05
4
2
0
0
20
40
60
80
Number of Nearest Neighbors
100
120
Speed vs. Error
450
Disk Accesses
400
350
300
250
SR-Tree
200
LSH
150
100
50
0
10
20
30
Error ( % )
40
50
Miss Ratio vs. Data Size
Approximate 1 -NNS
0.25
Error = 0.1
Miss Ratio
0.2
Error = 0.05
0.15
0.1
0.05
0
0
5000
10000
15000
Number of Database Points
20000
Conclusion
Better Query Time than Spatial Data
Structures
Scales well to higher dimensions and larger
data size ( Sub-linear dependence )
Predictable running time
Extra storage over-head
Inefficient for data with distances
concentrated around average
Future Work
• Investigate Hybrid-Data Structures obtained
by merging tree and hash based structures.
• Make use of the structure of the data-set to
systematically obtain LSH functions
• Explore other applications of LSH-type
techniques to data mining
Questions ?