the simhash algorithm(外文经典)

why?

shingling gives great results but the O(n2) runtime is poor
a set of 1e6 records would require 5e11 comparisons and even the cpp impl can "only" do 5e6 /sec
that's 2 months of runtime, 1.999 months too long in my mind.

we need an heuristic that can help us avoid all that brute forceness.....

what?

let's try making use of the hash algorithm simhash.
simhash was developed by Moses Charikar and is described in his paper
similarity estimation techniques from rounding algorithms.pdf

so what is so special about simhash?

a hash function usually hashes different values to totally different hash values

irb(main):006:0> p1 = 'the cat sat on the mat'
irb(main):005:0> p2 = 'the cat sat on a mat'
irb(main):007:0> p3 = 'we all scream for ice cream'
irb(main):007:0> p1.hash
=> 415542861
irb(main):007:0> p2.hash
=> 668720516
irb(main):007:0> p3.hash
=> 767429688

simhash is one where similiar items are hashed to similiar hash values
(by similar we mean the bitwise hamming distance between hash values)

irb(main):003:0> p1.simhash
=> 851459198
00110010110000000011110001111110

irb(main):004:0> p2.simhash
=> 847263864
00110010100000000011100001111000

irb(main):002:0> p3.simhash
=> 984968088
00111010101101010110101110011000
in this case we can see the hamming distance of the similar items (p1,p2)=4
whereas (p1,p3)=16 and (p2,p3)=12

how?

the simhash of a phrase is calculated as follow..

  1. pick a hashsize, lets say 32 bits
  2. let V = [0] * 32 # (ie 32 zeros)
  3. break the phrase up into features
    irb(main):003:0> 'the cat sat on the mat'.shingles
    => #<Set: {"th", "he", "e ", " c", "ca", "at", "t ",
        " s", "sa", " o", "on", "n ", " t", " m", "ma"}>
    
  4. hash each feature using a normal 32-bit hash algorithm
    "th".hash = -502157718
    "he".hash = -369049682
    ...
    
  5. for each hash
    if biti of hash is set then add 1 to V[i]
    if biti of hash is not set then take 1 from V[i]
  6. simhash biti is 1 if V[i] > 0 and 0 otherwise

who cares?

simhash is useful because if the simhash bitwise hamming distance of two phrases is low then their jaccard coefficient is high

but how can we make use of this to try to find the similiar items?

consider this other factoid....
in the case that two numbers have a low bitwise hamming distance
and the difference in their bits are in the lower order bits
then it turns out that they will end up close to each other if the list is sorted.

consider numbers

1 37586	1001001011010010 
2 50086	1100001110100110 7 <--(this column lists hamming
3 2648	0000101001011000 11    distance to previous entry)
4 934   0000001110100110 9
5 40957	1001111111111101 9
6 2650	0000101001011010 9
7 64475	1111101111011011 7
8 40955	1001111111111011 4

if we sort them

4 934   0000001110100110 
3 2648	0000101001011000 9
6 2650	0000101001011010 1
1 37586	1001001011010010 5
8 40955	1001111111111011 6
5 40957	1001111111111101 2
2 50086	1100001110100110 9
7 64475	1111101111011011 9
then we notice that two pairs with very smallest hamming distance
hdist(3,6)=1 and hdist(8,5)=2
have ended up adjacent to each other.

great! so in this case rather than check every combo we could just check the adjacent pairs of the list, each is a good candidate.

this has reduces the runtime
from n*(n-1)/2 coeff calcs O(n2)
to n fingerprints calcs O(n) + a sort O(n logn) + n coeff calcs O(n) ( which is O(n logn) overall )

alas there is another pair with a low hamming distance, hdist(4,2)=2 that have ended up totally apart at other ends of the list...

sorting only picked up the pairs that differed in their lower order bits.

to get around this consider another convenient property of bitwise hamming distance a permutation of the bits of two numbers preserves hamming distance..

if we permutate by 'rotating' the bits, ie bit shift left and replace lowest order bit with the 'lost' highest order bit we get 'new' fingerprints that have the same hamming distances

'rotate' bits left twice
4 3736	0000111010011000 
3 10592	0010100101100000 9
6 10600	0010100101101000 1
1 19274	0100101101001010 5
8 32750	0111111111101110 6
5 32758	0111111111110110 2
2 3739	0000111010011011 9
7 61295	1110111101101111 9

if we sort again by fingerprint

4 3736	0000111010011000 
2 3739	0000111010011011 2
3 10592	0010100101100000 11
6 10600	0010100101101000 1
1 19274	0100101101001010 5
8 32750	0111111111101110 6
5 32758	0111111111110110 2
7 61295	1110111101101111 6
yay! this time the (2,4) pair ended up adjacent
we also identified the (3,6) and (5,8) pairs as candidates again

so to avoid whether the items differ in the higher or lower order bits we can just so the following B times (where B is the bit length of the fingerprint)

  1. rotate the bits
  2. sort
  3. check adjacent

in the likely case that B << n the entire algorithm will still be under O(n2)

(in fact we wouldn't check each adjacent pair each rotation, rather collect the adjacent pairs in a set and check at the end, very similiar items would end up adjacent in a number of the rotations but only need to be checked once)

but it all hinges on the fact that similiar items end up next to each other in the sorted lists...timetime

experiments

lets see how well it does for finding similarities above 0.6
for simhash we give the runtime and it's accuracy in finding solutions
eg 5s (90%) implies simhash took 5s to find 90% of all possible resemblances under 0.6

entries shingling simhash16 simhash32
100 1s 1s (69%) 1s (69%)
200 4s 1s (61%) 2s (67%)
400 16s 2s (62%) 3s (65%)
800 1m 6s 3s (41%) 6s (54%)
1600 4m 30s 10s (31%) 11s (31%)

simhash observations

  • it's much faster, furthermore simhashs runtime growth is (just) sublinear, not exponential
  • 32bit performs better than 16bit, with only a little extra runtime required
  • it's getting more and more inaccurate... is this a function of just trying adjacents? perhaps. sampling a couple of cases where it missed some it would have found them if only trying a window of the next 5 instead of the next 1

if we change the code to consider the next 10 instead of just the next 1 we get....

entries shingling simhash16 simhash32
100 1s 2s (100%) 1s (100%)
200 4s 3s (91%) 5s (96%)
400 16s 6s (88%) 10s (93%)
800 1m 6s 13s (79%) 22s (83%)
1600 4m 30s 27s (67%) 48s (75%)
3200 21m 30s 58s (50%) 1m 43s (61%)

a x4 runtime for x2 the accuracy, seems worth it.
but it appears the window is going to have to get bigger and bigger as the dataset grows. this idea of a window size being some magic function of the dataset smells like a problem in the algorithm

an important thing to notice is that though the accuracy is dropping off it's the items with a lower similarity that not being found. with a fixed window size of 10 entries we can see that the best precision is in the high similarities

#entries >0.9 >0.8 >0.7 >0.6
400 9/9 (100%) 40/40 (100%) 65/65 (100%) 93/100 (93%)
800 19/19 (100%) 77/77 (100%) 143/147 (97%) 190/227 (83%)
1600 40/42 (95%) 133/139 (95%) 234/255 (91%) 307/406 (75%)
3200 64/75 (85%) 218/265 (82%) 397/546 (72%) 614/1003 (61%)

continuing to look at the 0.6 values in bigger sets

entries shingling (c++) simhash32
6,400 13s 5m (48%)
12,800 1m 8m 10s (42%)
25,600 4m 31s 17m 22s (37%)
51,200 19m 45m (31%)
102,400 1h 31m 1h 53m (27%)

so it seems the single threaded ruby simhash is juuuust about to overtake the multi threaded c++ brute force implementation but alas simhash is dying with out of memory.

there are a number of things i could do to reduce consumption but i've noticed that even with the rather large window size simhash32 is getting pretty terrible results for similarities over 0.6...

how about increasing the hash size from 32 to 64 bits? alas the ruby default hashing is only giving 32bit values so we need to use a larger custom hashing function.

how does swapping in a 64bit universal hash help? it takes quite a bit longer (somewhat expected; there are twice as many bits to rotate and the hash function is now implemented in ruby instead of using ruby's c hash impl) but seems more accurate...

entries shingling c++ simhash16 simhash32 usimhash64
800 1s 13s (79%) 22s (83%) 72s (100%)
1600 3s 27s (67%) 48s (75%) 2m 52s (97%)
3200 5s 58s (50%) 1m 43s (61%) 6m 38s (92%)
6400 14s   5m (48%) 19m 29s (83%)

and the behaviour is as before, it has captured the higher level of similarities more than the lower similarities

usimhash64 >0.9 >0.8 >0.7 >0.6
6400 100% 97% 92% 83%

the universal hash has another useful property. since it's seeded from a random value it can be run a number of times with different results. so what we can do is perform a number of independant runs in parallel on a multicore machine. this roughly makes the execution time 4 times faster (on 4 core box) bringing it back down to runtime required for simhash32

how's about trying a sketching algorithm?

你可能感兴趣的:(Algorithm,Engine,search)