shingling gives great results but the O(n2) runtime is poor
a set of 1e6 records would require 5e11 comparisons and even the cpp impl can "only" do 5e6 /sec
that's 2 months of runtime, 1.999 months too long in my mind.
we need an heuristic that can help us avoid all that brute forceness.....
let's try making use of the hash algorithm simhash.
simhash was developed by Moses Charikar and is described in his paper
similarity estimation techniques from rounding algorithms.pdf
so what is so special about simhash?
a hash function usually hashes different values to totally different hash values
irb(main):006:0> p1 = 'the cat sat on the mat' irb(main):005:0> p2 = 'the cat sat on a mat' irb(main):007:0> p3 = 'we all scream for ice cream' irb(main):007:0> p1.hash => 415542861 irb(main):007:0> p2.hash => 668720516 irb(main):007:0> p3.hash => 767429688
simhash is one where similiar items are hashed to similiar hash values
(by similar we mean the bitwise hamming distance between hash values)
irb(main):003:0> p1.simhash => 851459198 00110010110000000011110001111110 irb(main):004:0> p2.simhash => 847263864 00110010100000000011100001111000 irb(main):002:0> p3.simhash => 984968088 00111010101101010110101110011000in this case we can see the hamming distance of the similar items (p1,p2)=4
the simhash of a phrase is calculated as follow..
irb(main):003:0> 'the cat sat on the mat'.shingles => #<Set: {"th", "he", "e ", " c", "ca", "at", "t ", " s", "sa", " o", "on", "n ", " t", " m", "ma"}>
"th".hash = -502157718 "he".hash = -369049682 ...
simhash is useful because if the simhash bitwise hamming distance of two phrases is low then their jaccard coefficient is high
but how can we make use of this to try to find the similiar items?
consider this other factoid....
in the case that two numbers have a low bitwise hamming distance
and the difference in their bits are in the lower order bits
then it turns out that they will end up close to each other if the list is sorted.
consider numbers
1 37586 1001001011010010 2 50086 1100001110100110 7 <--(this column lists hamming 3 2648 0000101001011000 11 distance to previous entry) 4 934 0000001110100110 9 5 40957 1001111111111101 9 6 2650 0000101001011010 9 7 64475 1111101111011011 7 8 40955 1001111111111011 4
if we sort them
4 934 0000001110100110 3 2648 0000101001011000 9 6 2650 0000101001011010 1 1 37586 1001001011010010 5 8 40955 1001111111111011 6 5 40957 1001111111111101 2 2 50086 1100001110100110 9 7 64475 1111101111011011 9then we notice that two pairs with very smallest hamming distance
great! so in this case rather than check every combo we could just check the adjacent pairs of the list, each is a good candidate.
this has reduces the runtime
from n*(n-1)/2 coeff calcs O(n2)
to n fingerprints calcs O(n) + a sort O(n logn) + n coeff calcs O(n) ( which is O(n logn) overall )
alas there is another pair with a low hamming distance, hdist(4,2)=2 that have ended up totally apart at other ends of the list...
sorting only picked up the pairs that differed in their lower order bits.
to get around this consider another convenient property of bitwise hamming distance a permutation of the bits of two numbers preserves hamming distance..
if we permutate by 'rotating' the bits, ie bit shift left and replace lowest order bit with the 'lost' highest order bit we get 'new' fingerprints that have the same hamming distances
'rotate' bits left twice 4 3736 0000111010011000 3 10592 0010100101100000 9 6 10600 0010100101101000 1 1 19274 0100101101001010 5 8 32750 0111111111101110 6 5 32758 0111111111110110 2 2 3739 0000111010011011 9 7 61295 1110111101101111 9
if we sort again by fingerprint
4 3736 0000111010011000 2 3739 0000111010011011 2 3 10592 0010100101100000 11 6 10600 0010100101101000 1 1 19274 0100101101001010 5 8 32750 0111111111101110 6 5 32758 0111111111110110 2 7 61295 1110111101101111 6yay! this time the (2,4) pair ended up adjacent
so to avoid whether the items differ in the higher or lower order bits we can just so the following B times (where B is the bit length of the fingerprint)
in the likely case that B << n the entire algorithm will still be under O(n2)
(in fact we wouldn't check each adjacent pair each rotation, rather collect the adjacent pairs in a set and check at the end, very similiar items would end up adjacent in a number of the rotations but only need to be checked once)
but it all hinges on the fact that similiar items end up next to each other in the sorted lists...timetime
lets see how well it does for finding similarities above 0.6
for simhash we give the runtime and it's accuracy in finding solutions
eg 5s (90%) implies simhash took 5s to find 90% of all possible resemblances under 0.6
entries | shingling | simhash16 | simhash32 |
100 | 1s | 1s (69%) | 1s (69%) |
200 | 4s | 1s (61%) | 2s (67%) |
400 | 16s | 2s (62%) | 3s (65%) |
800 | 1m 6s | 3s (41%) | 6s (54%) |
1600 | 4m 30s | 10s (31%) | 11s (31%) |
simhash observations
if we change the code to consider the next 10 instead of just the next 1 we get....
entries | shingling | simhash16 | simhash32 |
100 | 1s | 2s (100%) | 1s (100%) |
200 | 4s | 3s (91%) | 5s (96%) |
400 | 16s | 6s (88%) | 10s (93%) |
800 | 1m 6s | 13s (79%) | 22s (83%) |
1600 | 4m 30s | 27s (67%) | 48s (75%) |
3200 | 21m 30s | 58s (50%) | 1m 43s (61%) |
a x4 runtime for x2 the accuracy, seems worth it.
but it appears the window is going to have to get bigger and bigger as the dataset grows. this idea of a window size being some magic function of the dataset smells like a problem in the algorithm
an important thing to notice is that though the accuracy is dropping off it's the items with a lower similarity that not being found. with a fixed window size of 10 entries we can see that the best precision is in the high similarities
#entries | >0.9 | >0.8 | >0.7 | >0.6 |
400 | 9/9 (100%) | 40/40 (100%) | 65/65 (100%) | 93/100 (93%) |
800 | 19/19 (100%) | 77/77 (100%) | 143/147 (97%) | 190/227 (83%) |
1600 | 40/42 (95%) | 133/139 (95%) | 234/255 (91%) | 307/406 (75%) |
3200 | 64/75 (85%) | 218/265 (82%) | 397/546 (72%) | 614/1003 (61%) |
continuing to look at the 0.6 values in bigger sets
entries | shingling (c++) | simhash32 |
6,400 | 13s | 5m (48%) |
12,800 | 1m | 8m 10s (42%) |
25,600 | 4m 31s | 17m 22s (37%) |
51,200 | 19m | 45m (31%) |
102,400 | 1h 31m | 1h 53m (27%) |
so it seems the single threaded ruby simhash is juuuust about to overtake the multi threaded c++ brute force implementation but alas simhash is dying with out of memory.
there are a number of things i could do to reduce consumption but i've noticed that even with the rather large window size simhash32 is getting pretty terrible results for similarities over 0.6...
how about increasing the hash size from 32 to 64 bits? alas the ruby default hashing is only giving 32bit values so we need to use a larger custom hashing function.
how does swapping in a 64bit universal hash help? it takes quite a bit longer (somewhat expected; there are twice as many bits to rotate and the hash function is now implemented in ruby instead of using ruby's c hash impl) but seems more accurate...
entries | shingling c++ | simhash16 | simhash32 | usimhash64 |
800 | 1s | 13s (79%) | 22s (83%) | 72s (100%) |
1600 | 3s | 27s (67%) | 48s (75%) | 2m 52s (97%) |
3200 | 5s | 58s (50%) | 1m 43s (61%) | 6m 38s (92%) |
6400 | 14s | 5m (48%) | 19m 29s (83%) |
and the behaviour is as before, it has captured the higher level of similarities more than the lower similarities
usimhash64 | >0.9 | >0.8 | >0.7 | >0.6 |
6400 | 100% | 97% | 92% | 83% |
the universal hash has another useful property. since it's seeded from a random value it can be run a number of times with different results. so what we can do is perform a number of independant runs in parallel on a multicore machine. this roughly makes the execution time 4 times faster (on 4 core box) bringing it back down to runtime required for simhash32
how's about trying a sketching algorithm?